GitHub - kekzl/imp: High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). Native NVFP4/GGUF, 270 tok/s decode on Qwen3-Coder-30B MoE. Written entirely by Claude Code.

From-scratch CUDA inference engine for the NVIDIA RTX 5090 (sm_120a).
One architecture, fully exploited: +37–72% faster decode than llama.cpp on dense GGUF, and the
only engine that runs native NVFP4 on consumer Blackwell. ~97k lines, 100% written by Claude Code.

What is imp

A C++20/CUDA inference engine that targets exactly one architecture: the NVIDIA RTX 5090 / RTX PRO 6000 (GB202, sm_120a). The build emits raw sm_120a SASS via direct gencode — no portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. A compute_120f PTX fallback covers the other Blackwell consumer SKUs (RTX 5080 / 5070 Ti). imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet + Mamba2 scans, CUDA Graphs, and an OpenAI/Anthropic-compatible HTTP server — ~97k lines of C++/CUDA (plus ~50k lines of tooling and tests), every line generated by an AI coding agent.

Why this exists

The RTX 5090 shipped with native FP4 tensor cores, 32 GB GDDR7 at 1.8 TB/s, and a new ISA (sm_120a) that no existing inference engine fully exploits. llama.cpp targets broad hardware compatibility and runs GGUF through dequant-to-FP16 paths — fast everywhere, but leaving consumer-Blackwell-specific features (NVFP4 block-scaled mma.sync MMA, FP8 f8f6f4 scores) on the table. vLLM targets datacenter Blackwell (B200/B300, sm_100) and gates key backends like FlashInfer MoE on tcgen05 — an opcode family that consumer Blackwell (sm_120) doesn't have, so vLLM's NVFP4 path falls back to Marlin or fails outright on the 5090.

Consumer Blackwell is not a smaller datacenter Blackwell: sm_120a has no tcgen05, no TMEM, no wgmma, and no TMA warp-specialized grouped GEMM. imp is built against that reality — the FP4 path is mma.sync mxf4nvf4 with FlashAttention-2-style block-scaling, not the Hopper/B200 kernel designs.

That leaves a gap: 30B-class MoE models (Qwen3-Coder, Qwen3.6, Gemma-4) fit in 32 GB as NVFP4 prequant and should decode at 150–270 tok/s on this hardware, but neither engine delivers that today. imp exists to close that gap — a single-target engine that treats the RTX 5090 as the only GPU that matters and optimizes accordingly.

NVFP4 prequant (from NVIDIA Model Optimizer / llm-compressor) is the native weight format for Blackwell's FP4 tensor cores. imp loads these SafeTensors directly into its NVFP4 decode cache and CUTLASS v4.5.1 GEMM kernels with no re-quantization (there are no FP4 cuBLASLt kernels on sm_120, so CUTLASS is the primary GEMM path). For GGUF models, imp builds an NVFP4 decode cache that converts Q8_0/Q6_K weights to FP4 at init time, getting the bandwidth benefit of sub-byte weights on the decode hot path while keeping full-precision prefill via source dequant.

Performance

Single RTX 5090, greedy, CUDA 13.3, CUDA Graphs on. Headline numbers (decode, the reliable A/B signal — see BENCHMARKS.md for dated, commit-anchored measurements with exact commands):

GGUF dense decode: Qwen3-8B Q8_0 at ~268 tok/s (CI-gated baseline) — +37–72% over llama.cpp with full offload and flash attention.
NVFP4 SafeTensors decode: 30B-class MoE at ~245–307 tok/s (Qwen3-30B/Coder-30B 307, Qwen3.6-35B 245, Gemma-4-26B 259) — effectively uncontested on sm_120, where vLLM's NVFP4 path needs tcgen05 and llama.cpp has no native NVFP4 support.
Honest losses: NVFP4 long-context prefill (~1.7× behind vLLM at pp4096, attention-bound; imp WINS below ~2k context), Qwen3.6-35B GGUF decode (−31%, structural FP16 GDN-projection tax).

Every number, with date, commit SHA, CUDA version, quant and the exact command: BENCHMARKS.md. Methodology details: docs/performance.md.

Should I use this?

Use llama.cpp if you want a stable, mature engine with broad model/hardware support and CPU fallback. It runs everywhere and supports everything.

Use vLLM if you serve high-concurrency batched workloads on datacenter GPUs (H100/B200). It has continuous batching, PagedAttention at scale, and production-grade serving.

Use imp if you have Blackwell consumer/workstation hardware (RTX 5090, RTX PRO 6000), want native NVFP4/FP4 without dequant overhead, and don't mind running a prototype. imp is fastest-on-Blackwell for single-user decode, but it's experimental, single-GPU, and single-author.

Known limitations

Single GPU only. No tensor parallelism, no multi-GPU.
Consumer Blackwell only. sm_120a SASS + compute_120f PTX fallback. No Hopper, Ada, Ampere, datacenter Blackwell. No AMD, Intel, Apple, or CPU paths.
GGUF prefill: largely fixed 2026-06-07. The INT8-IMMA prefill family (#612–#617) puts Qwen3-30B-A3B and Qwen3-14B-Q6_K AHEAD of llama.cpp; Q8_0 dense and gemma-4 sit at 1.20×, Qwen3.6-35B at 1.55× (GDN share is quality-locked). NVFP4 prefill trails vLLM ~1.7× at pp4096 only — imp WINS below ~2k context (see docs/audit/prefill_gap_2026_06_07.md).
MoE/hybrid GGUF decode loses on Qwen3.6-35B (~−31% vs llama.cpp): an FP16-projection tax on the GDN/attention path that NVFP4 can't address.
Only tested models work reliably. Anything not on the supported list may load but hasn't been verified.
Prefill numbers are noisy. cuBLAS autotuning causes up to 2.6× variance across container restarts.
Single-author, experimental. Don't deploy this anywhere it matters.

Quickstart

Everything runs in Docker — no local CUDA toolkit needed. Prebuilt images are on GHCR (built per release for x86-64 + sm_120a):

# Drop a GGUF or SafeTensors model into ./models/
mkdir -p models

# Run the server from the prebuilt image
docker run --gpus all -v ./models:/models -p 8080:8080 \
  ghcr.io/kekzl/imp:latest --model /models/your-model.gguf

# Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Or build from source (tracks main instead of the latest release):

git clone https://github.com/kekzl/imp.git && cd imp
docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/your-model.gguf

CLI reference, server flags, config, and C API: docs/usage.md.

Supported models

Family	Variants	Quantizations
Qwen3 / Qwen3-MoE	dense + MoE (Coder-30B-A3B)	Q4_K_M, Q6_K, Q8_0, NVFP4
Qwen3.5 / Qwen3.6	GDN + attention (+ MoE)	Q4_K_M, Q8_0, NVFP4
Gemma-4	26B-A4B MoE, 31B dense	Q4_K_M, Q5_K_M, Q8_0, NVFP4
Gemma-3	text + vision (SigLIP)	GGUF
Phi-4-reasoning-plus	dense, fused projections	NVFP4
gpt-oss-20b	MoE (32 experts, top-4), Harmony	SafeTensors MXFP4 (native)
Nemotron-H	Mamba2 + Attention + MoE	NVFP4, GGUF
Llama / Mistral / DeepSeek	dense + MoE	GGUF (Q*_K, Q8_0)

VRAM, decode tok/s, and per-model notes: docs/supported-models.md.

Features


Quantization	GGUF Q4_K_M/Q5_K_M/Q6_K/Q8_0 + IQ4_NL/IQ4_XS, SafeTensors NVFP4 (prequant), MXFP4. NVFP4 KV cache (`--kv-nvfp4`) for 4× context compression at decode parity.
LoRA	PEFT adapter hot-swap (`--lora name=path`, per-request `"lora"` field) — runtime low-rank deltas, no weight patching, works with every quant path.
Attention	Prefill: FP16 cuBLAS below the auto `fmha_prefill_threshold` (the largest chunk whose S-matrix fits, ~2.5k tokens), then the FMHA family above it — an `mma.sync` `m16n8k32` FP8-E4M3 score kernel and a register-resident FlashAttention-2 kernel (head_dim 128). Decode: paged attention (block_size 16) switching on KV dtype (FP16/FP8/INT8/INT4/NVFP4/MXFP4). Auto-dispatch per phase × dtype × layer — see `docs/attention-dispatch.md`.
Architectures	Dense transformers, Mixture-of-Experts (top-k grouped GEMM), Gated DeltaNet (fused recurrent scan), Mamba2 (SSM), SigLIP/Gemma-4v vision encoders.
`sm_120a` kernels	NVFP4 block-scaled `mma.sync mxf4nvf4` GEMM/GEMV (CUTLASS v4.5.1), FP8 `f8f6f4` attention scores, FA2 block-scaling, packed `cvt.e2m1`/`cvt.e4m3x2` dequant, PDL, Green Contexts. No `tcgen05`/TMEM/`wgmma`/TMA-WS — those are datacenter Blackwell only.
Server	OpenAI `/v1/chat/completions` + `/v1/completions` + `/v1/embeddings` + `/tokenize` + `/detokenize`; Anthropic `/v1/messages` with real per-token SSE streaming and `cache_control` prompt caching (prefix-cache pinning + `cache_read`/`cache_creation_input_tokens` usage reporting; prefix cache default-on since #538); Prometheus `/metrics`. Tool/function calling, `json_object` + `json_schema` constrained decoding (whole-token validated), `reasoning_content` separation (DeepSeek format) + think budget. Strict single-model semantics (`/v1/models` lists only the loaded model; foreign names get a 404, no auto-swap). API-key auth, rate limiting, JSONL request logging.
Runtime	CUDA Graphs (auto per model), `imp.conf` + CLI config, Jinja2 chat templates with macro support. `degen_suite.py` is the coherence quality-gate after hot-path changes.

How was this built?

imp is written entirely by Claude Code (Opus 4.6 → 4.8) — 900+ commits, ~97k lines of C++/CUDA in the engine itself (plus ~50k more in tooling and 91 test files / 970+ tests). A human (@kekzl) directs architecture decisions, picks models, and runs benchmarks. The AI agent writes all code: GGUF parser, BPE tokenizer, paged KV cache, CUTLASS NVFP4 GEMM dispatch, FP8/FA2 attention kernels, MoE routing, GDN/Mamba2 scan kernels, HTTP server, CI, the build system — nothing is copy-pasted from other engines.

This is a running experiment in how far AI-generated systems code scales. The architecture doc covers the end-to-end pipeline if you want to see how the pieces fit together.

Documentation

Document
Architecture (diagram)	Load → engine init → prefill → decode pipeline
Supported models	Tested families with VRAM + tok/s
Performance	Throughput numbers + methodology
Usage & reference	Build, server, CLI, C API
Quantization	GGUF, NVFP4, MXFP4, FP8 KV — formats and trade-offs
Attention dispatch	Per-(phase x dtype x layer) kernel selection
sm_120a kernels	Kernel optimization notes
Determinism	Reproducibility guarantees + known limits
imp.conf reference	All runtime configuration keys
Roadmap	Open issues and in-flight work
Changelog	Per-release notes

Building from source

# With CUDA 13.3+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Or via Docker (canonical):
make build          # → imp:test image
make verify-fast    # build + tests + perf gate (~90s)

Full build options and test commands: docs/usage.md. Contributing: CONTRIBUTING.md.

License

MIT — see LICENSE.

Acknowledgements

Stands on the shoulders of llama.cpp. The GGUF format, GGML quantization schemes, and most practical conventions for local LLM inference were established there.

Built by @kekzl with Claude Code. Heavy use of CUTLASS v4.5.1 for sm_120a NVFP4/MXFP4 GEMM and grouped MoE kernels; the FMHA/FA2 attention kernels are hand-written mma.sync. Other references: Flash Attention 2, NVIDIA Model Optimizer, llm-compressor.

Name		Name	Last commit message	Last commit date
Latest commit History 997 Commits
.claude/skills		.claude/skills
.github		.github
bench		bench
cmake		cmake
docs		docs
include/imp		include/imp
monitoring		monitoring
scripts		scripts
src		src
tests		tests
third_party/stb		third_party/stb
tools		tools
.clang-format		.clang-format
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
DISPATCH.md		DISPATCH.md
Dockerfile		Dockerfile
GOAL.md		GOAL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
imp		imp
imp.conf.example		imp.conf.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is imp

Why this exists

Performance

Should I use this?

Known limitations

Quickstart

Supported models

Features

How was this built?

Documentation

Building from source

License

Acknowledgements

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is imp

Why this exists

Performance

Should I use this?

Known limitations

Quickstart

Supported models

Features

How was this built?

Documentation

Building from source

License

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages