You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). Native NVFP4/GGUF, 270 tok/s decode on Qwen3-Coder-30B MoE. Written entirely by Claude Code.
Measuring what makes a VLA fast enough to run on the robot: a 5.9x CUDA-graph win, four experiments on why low-bit doesn't, a budget-driven deploy-compiler, and a runtime safety supervisor. Live demo: hf.co/spaces/LaelaZ/embodied-efficiency
Prefill performance study on Qwen2.5-7B using vLLM. Compares static vs mixed (bucketed) prefill under eager execution and CUDA Graphs, with controlled concurrency and real-world latency/throughput metrics.