From-scratch CUDA inference engine for the NVIDIA RTX 5090 (sm_120a).
One architecture, fully exploited: +37–72% faster decode than llama.cpp on dense GGUF, and the
only engine that runs native NVFP4 on consumer Blackwell. ~97k lines, 100% written by Claude Code.
A C++20/CUDA inference engine that targets exactly one architecture: the NVIDIA RTX 5090 / RTX PRO 6000 (GB202, sm_120a). The build emits raw sm_120a SASS via direct gencode — no portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. A compute_120f PTX fallback covers the other Blackwell consumer SKUs (RTX 5080 / 5070 Ti). imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet + Mamba2 scans, CUDA Graphs, and an OpenAI/Anthropic-compatible HTTP server — ~97k lines of C++/CUDA (plus ~50k lines of tooling and tests), every line generated by an AI coding agent.
The RTX 5090 shipped with native FP4 tensor cores, 32 GB GDDR7 at 1.8 TB/s, and a new ISA (sm_120a) that no existing inference engine fully exploits. llama.cpp targets broad hardware compatibility and runs GGUF through dequant-to-FP16 paths — fast everywhere, but leaving consumer-Blackwell-specific features (NVFP4 block-scaled mma.sync MMA, FP8 f8f6f4 scores) on the table. vLLM targets datacenter Blackwell (B200/B300, sm_100) and gates key backends like FlashInfer MoE on tcgen05 — an opcode family that consumer Blackwell (sm_120) doesn't have, so vLLM's NVFP4 path falls back to Marlin or fails outright on the 5090.
Consumer Blackwell is not a smaller datacenter Blackwell: sm_120a has no tcgen05, no TMEM, no wgmma, and no TMA warp-specialized grouped GEMM. imp is built against that reality — the FP4 path is mma.sync mxf4nvf4 with FlashAttention-2-style block-scaling, not the Hopper/B200 kernel designs.
That leaves a gap: 30B-class MoE models (Qwen3-Coder, Qwen3.6, Gemma-4) fit in 32 GB as NVFP4 prequant and should decode at 150–270 tok/s on this hardware, but neither engine delivers that today. imp exists to close that gap — a single-target engine that treats the RTX 5090 as the only GPU that matters and optimizes accordingly.
NVFP4 prequant (from NVIDIA Model Optimizer / llm-compressor) is the native weight format for Blackwell's FP4 tensor cores. imp loads these SafeTensors directly into its NVFP4 decode cache and CUTLASS v4.5.1 GEMM kernels with no re-quantization (there are no FP4 cuBLASLt kernels on sm_120, so CUTLASS is the primary GEMM path). For GGUF models, imp builds an NVFP4 decode cache that converts Q8_0/Q6_K weights to FP4 at init time, getting the bandwidth benefit of sub-byte weights on the decode hot path while keeping full-precision prefill via source dequant.
Single RTX 5090, greedy, CUDA 13.3, CUDA Graphs on. Headline numbers (decode, the reliable A/B signal — see BENCHMARKS.md for dated, commit-anchored measurements with exact commands):
- GGUF dense decode: Qwen3-8B Q8_0 at ~268 tok/s (CI-gated baseline) — +37–72% over llama.cpp with full offload and flash attention.
- NVFP4 SafeTensors decode: 30B-class MoE at ~245–307 tok/s
(Qwen3-30B/Coder-30B 307, Qwen3.6-35B 245, Gemma-4-26B 259) — effectively
uncontested on
sm_120, where vLLM's NVFP4 path needstcgen05and llama.cpp has no native NVFP4 support. - Honest losses: NVFP4 long-context prefill (~1.7× behind vLLM at pp4096, attention-bound; imp WINS below ~2k context), Qwen3.6-35B GGUF decode (−31%, structural FP16 GDN-projection tax).
Every number, with date, commit SHA, CUDA version, quant and the exact
command: BENCHMARKS.md. Methodology details:
docs/performance.md.
Use llama.cpp if you want a stable, mature engine with broad model/hardware support and CPU fallback. It runs everywhere and supports everything.
Use vLLM if you serve high-concurrency batched workloads on datacenter GPUs (H100/B200). It has continuous batching, PagedAttention at scale, and production-grade serving.
Use imp if you have Blackwell consumer/workstation hardware (RTX 5090, RTX PRO 6000), want native NVFP4/FP4 without dequant overhead, and don't mind running a prototype. imp is fastest-on-Blackwell for single-user decode, but it's experimental, single-GPU, and single-author.
- Single GPU only. No tensor parallelism, no multi-GPU.
- Consumer Blackwell only.
sm_120aSASS +compute_120fPTX fallback. No Hopper, Ada, Ampere, datacenter Blackwell. No AMD, Intel, Apple, or CPU paths. - GGUF prefill: largely fixed 2026-06-07. The INT8-IMMA prefill family (#612–#617) puts Qwen3-30B-A3B and Qwen3-14B-Q6_K AHEAD of llama.cpp; Q8_0 dense and gemma-4 sit at 1.20×, Qwen3.6-35B at 1.55× (GDN share is quality-locked). NVFP4 prefill trails vLLM ~1.7× at pp4096 only — imp WINS below ~2k context (see docs/audit/prefill_gap_2026_06_07.md).
- MoE/hybrid GGUF decode loses on Qwen3.6-35B (~−31% vs llama.cpp): an FP16-projection tax on the GDN/attention path that NVFP4 can't address.
- Only tested models work reliably. Anything not on the supported list may load but hasn't been verified.
- Prefill numbers are noisy. cuBLAS autotuning causes up to 2.6× variance across container restarts.
- Single-author, experimental. Don't deploy this anywhere it matters.
Everything runs in Docker — no local CUDA toolkit needed. Prebuilt images are on GHCR (built per release for x86-64 + sm_120a):
# Drop a GGUF or SafeTensors model into ./models/
mkdir -p models
# Run the server from the prebuilt image
docker run --gpus all -v ./models:/models -p 8080:8080 \
ghcr.io/kekzl/imp:latest --model /models/your-model.gguf
# Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'Or build from source (tracks main instead of the latest release):
git clone https://github.com/kekzl/imp.git && cd imp
docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
imp:latest --model /models/your-model.ggufCLI reference, server flags, config, and C API: docs/usage.md.
| Family | Variants | Quantizations |
|---|---|---|
| Qwen3 / Qwen3-MoE | dense + MoE (Coder-30B-A3B) | Q4_K_M, Q6_K, Q8_0, NVFP4 |
| Qwen3.5 / Qwen3.6 | GDN + attention (+ MoE) | Q4_K_M, Q8_0, NVFP4 |
| Gemma-4 | 26B-A4B MoE, 31B dense | Q4_K_M, Q5_K_M, Q8_0, NVFP4 |
| Gemma-3 | text + vision (SigLIP) | GGUF |
| Phi-4-reasoning-plus | dense, fused projections | NVFP4 |
| gpt-oss-20b | MoE (32 experts, top-4), Harmony | SafeTensors MXFP4 (native) |
| Nemotron-H | Mamba2 + Attention + MoE | NVFP4, GGUF |
| Llama / Mistral / DeepSeek | dense + MoE | GGUF (Q*_K, Q8_0) |
VRAM, decode tok/s, and per-model notes: docs/supported-models.md.
| Quantization | GGUF Q4_K_M/Q5_K_M/Q6_K/Q8_0 + IQ4_NL/IQ4_XS, SafeTensors NVFP4 (prequant), MXFP4. NVFP4 KV cache (--kv-nvfp4) for 4× context compression at decode parity. |
| LoRA | PEFT adapter hot-swap (--lora name=path, per-request "lora" field) — runtime low-rank deltas, no weight patching, works with every quant path. |
| Attention | Prefill: FP16 cuBLAS below the auto fmha_prefill_threshold (the largest chunk whose S-matrix fits, ~2.5k tokens), then the FMHA family above it — an mma.sync m16n8k32 FP8-E4M3 score kernel and a register-resident FlashAttention-2 kernel (head_dim 128). Decode: paged attention (block_size 16) switching on KV dtype (FP16/FP8/INT8/INT4/NVFP4/MXFP4). Auto-dispatch per phase × dtype × layer — see docs/attention-dispatch.md. |
| Architectures | Dense transformers, Mixture-of-Experts (top-k grouped GEMM), Gated DeltaNet (fused recurrent scan), Mamba2 (SSM), SigLIP/Gemma-4v vision encoders. |
sm_120a kernels |
NVFP4 block-scaled mma.sync mxf4nvf4 GEMM/GEMV (CUTLASS v4.5.1), FP8 f8f6f4 attention scores, FA2 block-scaling, packed cvt.e2m1/cvt.e4m3x2 dequant, PDL, Green Contexts. No tcgen05/TMEM/wgmma/TMA-WS — those are datacenter Blackwell only. |
| Server | OpenAI /v1/chat/completions + /v1/completions + /v1/embeddings + /tokenize + /detokenize; Anthropic /v1/messages with real per-token SSE streaming and cache_control prompt caching (prefix-cache pinning + cache_read/cache_creation_input_tokens usage reporting; prefix cache default-on since #538); Prometheus /metrics. Tool/function calling, json_object + json_schema constrained decoding (whole-token validated), reasoning_content separation (DeepSeek format) + think budget. Strict single-model semantics (/v1/models lists only the loaded model; foreign names get a 404, no auto-swap). API-key auth, rate limiting, JSONL request logging. |
| Runtime | CUDA Graphs (auto per model), imp.conf + CLI config, Jinja2 chat templates with macro support. degen_suite.py is the coherence quality-gate after hot-path changes. |
imp is written entirely by Claude Code (Opus 4.6 → 4.8) — 900+ commits, ~97k lines of C++/CUDA in the engine itself (plus ~50k more in tooling and 91 test files / 970+ tests). A human (@kekzl) directs architecture decisions, picks models, and runs benchmarks. The AI agent writes all code: GGUF parser, BPE tokenizer, paged KV cache, CUTLASS NVFP4 GEMM dispatch, FP8/FA2 attention kernels, MoE routing, GDN/Mamba2 scan kernels, HTTP server, CI, the build system — nothing is copy-pasted from other engines.
This is a running experiment in how far AI-generated systems code scales. The architecture doc covers the end-to-end pipeline if you want to see how the pieces fit together.
| Document | |
|---|---|
| Architecture (diagram) | Load → engine init → prefill → decode pipeline |
| Supported models | Tested families with VRAM + tok/s |
| Performance | Throughput numbers + methodology |
| Usage & reference | Build, server, CLI, C API |
| Quantization | GGUF, NVFP4, MXFP4, FP8 KV — formats and trade-offs |
| Attention dispatch | Per-(phase x dtype x layer) kernel selection |
| sm_120a kernels | Kernel optimization notes |
| Determinism | Reproducibility guarantees + known limits |
| imp.conf reference | All runtime configuration keys |
| Roadmap | Open issues and in-flight work |
| Changelog | Per-release notes |
# With CUDA 13.3+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Or via Docker (canonical):
make build # → imp:test image
make verify-fast # build + tests + perf gate (~90s)Full build options and test commands: docs/usage.md. Contributing: CONTRIBUTING.md.
MIT — see LICENSE.
Stands on the shoulders of llama.cpp. The GGUF format, GGML quantization schemes, and most practical conventions for local LLM inference were established there.
Built by @kekzl with Claude Code. Heavy use of CUTLASS v4.5.1 for sm_120a NVFP4/MXFP4 GEMM and grouped MoE kernels; the FMHA/FA2 attention kernels are hand-written mma.sync. Other references: Flash Attention 2, NVIDIA Model Optimizer, llm-compressor.