Skip to content

kekzl/imp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

997 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

imp

From-scratch CUDA inference engine for the NVIDIA RTX 5090 (sm_120a).
One architecture, fully exploited: +37–72% faster decode than llama.cpp on dense GGUF, and the
only engine that runs native NVFP4 on consumer Blackwell. ~97k lines, 100% written by Claude Code.

License CUDA 13.3 C++20 Status: experimental


What is imp

A C++20/CUDA inference engine that targets exactly one architecture: the NVIDIA RTX 5090 / RTX PRO 6000 (GB202, sm_120a). The build emits raw sm_120a SASS via direct gencode — no portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. A compute_120f PTX fallback covers the other Blackwell consumer SKUs (RTX 5080 / 5070 Ti). imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet + Mamba2 scans, CUDA Graphs, and an OpenAI/Anthropic-compatible HTTP server — ~97k lines of C++/CUDA (plus ~50k lines of tooling and tests), every line generated by an AI coding agent.

Why this exists

The RTX 5090 shipped with native FP4 tensor cores, 32 GB GDDR7 at 1.8 TB/s, and a new ISA (sm_120a) that no existing inference engine fully exploits. llama.cpp targets broad hardware compatibility and runs GGUF through dequant-to-FP16 paths — fast everywhere, but leaving consumer-Blackwell-specific features (NVFP4 block-scaled mma.sync MMA, FP8 f8f6f4 scores) on the table. vLLM targets datacenter Blackwell (B200/B300, sm_100) and gates key backends like FlashInfer MoE on tcgen05 — an opcode family that consumer Blackwell (sm_120) doesn't have, so vLLM's NVFP4 path falls back to Marlin or fails outright on the 5090.

Consumer Blackwell is not a smaller datacenter Blackwell: sm_120a has no tcgen05, no TMEM, no wgmma, and no TMA warp-specialized grouped GEMM. imp is built against that reality — the FP4 path is mma.sync mxf4nvf4 with FlashAttention-2-style block-scaling, not the Hopper/B200 kernel designs.

That leaves a gap: 30B-class MoE models (Qwen3-Coder, Qwen3.6, Gemma-4) fit in 32 GB as NVFP4 prequant and should decode at 150–270 tok/s on this hardware, but neither engine delivers that today. imp exists to close that gap — a single-target engine that treats the RTX 5090 as the only GPU that matters and optimizes accordingly.

NVFP4 prequant (from NVIDIA Model Optimizer / llm-compressor) is the native weight format for Blackwell's FP4 tensor cores. imp loads these SafeTensors directly into its NVFP4 decode cache and CUTLASS v4.5.1 GEMM kernels with no re-quantization (there are no FP4 cuBLASLt kernels on sm_120, so CUTLASS is the primary GEMM path). For GGUF models, imp builds an NVFP4 decode cache that converts Q8_0/Q6_K weights to FP4 at init time, getting the bandwidth benefit of sub-byte weights on the decode hot path while keeping full-precision prefill via source dequant.

Performance

Single RTX 5090, greedy, CUDA 13.3, CUDA Graphs on. Headline numbers (decode, the reliable A/B signal — see BENCHMARKS.md for dated, commit-anchored measurements with exact commands):

  • GGUF dense decode: Qwen3-8B Q8_0 at ~268 tok/s (CI-gated baseline) — +37–72% over llama.cpp with full offload and flash attention.
  • NVFP4 SafeTensors decode: 30B-class MoE at ~245–307 tok/s (Qwen3-30B/Coder-30B 307, Qwen3.6-35B 245, Gemma-4-26B 259) — effectively uncontested on sm_120, where vLLM's NVFP4 path needs tcgen05 and llama.cpp has no native NVFP4 support.
  • Honest losses: NVFP4 long-context prefill (~1.7× behind vLLM at pp4096, attention-bound; imp WINS below ~2k context), Qwen3.6-35B GGUF decode (−31%, structural FP16 GDN-projection tax).

Every number, with date, commit SHA, CUDA version, quant and the exact command: BENCHMARKS.md. Methodology details: docs/performance.md.

Should I use this?

Use llama.cpp if you want a stable, mature engine with broad model/hardware support and CPU fallback. It runs everywhere and supports everything.

Use vLLM if you serve high-concurrency batched workloads on datacenter GPUs (H100/B200). It has continuous batching, PagedAttention at scale, and production-grade serving.

Use imp if you have Blackwell consumer/workstation hardware (RTX 5090, RTX PRO 6000), want native NVFP4/FP4 without dequant overhead, and don't mind running a prototype. imp is fastest-on-Blackwell for single-user decode, but it's experimental, single-GPU, and single-author.

Known limitations

  • Single GPU only. No tensor parallelism, no multi-GPU.
  • Consumer Blackwell only. sm_120a SASS + compute_120f PTX fallback. No Hopper, Ada, Ampere, datacenter Blackwell. No AMD, Intel, Apple, or CPU paths.
  • GGUF prefill: largely fixed 2026-06-07. The INT8-IMMA prefill family (#612–#617) puts Qwen3-30B-A3B and Qwen3-14B-Q6_K AHEAD of llama.cpp; Q8_0 dense and gemma-4 sit at 1.20×, Qwen3.6-35B at 1.55× (GDN share is quality-locked). NVFP4 prefill trails vLLM ~1.7× at pp4096 only — imp WINS below ~2k context (see docs/audit/prefill_gap_2026_06_07.md).
  • MoE/hybrid GGUF decode loses on Qwen3.6-35B (~−31% vs llama.cpp): an FP16-projection tax on the GDN/attention path that NVFP4 can't address.
  • Only tested models work reliably. Anything not on the supported list may load but hasn't been verified.
  • Prefill numbers are noisy. cuBLAS autotuning causes up to 2.6× variance across container restarts.
  • Single-author, experimental. Don't deploy this anywhere it matters.

Quickstart

Everything runs in Docker — no local CUDA toolkit needed. Prebuilt images are on GHCR (built per release for x86-64 + sm_120a):

# Drop a GGUF or SafeTensors model into ./models/
mkdir -p models

# Run the server from the prebuilt image
docker run --gpus all -v ./models:/models -p 8080:8080 \
  ghcr.io/kekzl/imp:latest --model /models/your-model.gguf

# Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Or build from source (tracks main instead of the latest release):

git clone https://github.com/kekzl/imp.git && cd imp
docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/your-model.gguf

CLI reference, server flags, config, and C API: docs/usage.md.

Supported models

Family Variants Quantizations
Qwen3 / Qwen3-MoE dense + MoE (Coder-30B-A3B) Q4_K_M, Q6_K, Q8_0, NVFP4
Qwen3.5 / Qwen3.6 GDN + attention (+ MoE) Q4_K_M, Q8_0, NVFP4
Gemma-4 26B-A4B MoE, 31B dense Q4_K_M, Q5_K_M, Q8_0, NVFP4
Gemma-3 text + vision (SigLIP) GGUF
Phi-4-reasoning-plus dense, fused projections NVFP4
gpt-oss-20b MoE (32 experts, top-4), Harmony SafeTensors MXFP4 (native)
Nemotron-H Mamba2 + Attention + MoE NVFP4, GGUF
Llama / Mistral / DeepSeek dense + MoE GGUF (Q*_K, Q8_0)

VRAM, decode tok/s, and per-model notes: docs/supported-models.md.

Features

Quantization GGUF Q4_K_M/Q5_K_M/Q6_K/Q8_0 + IQ4_NL/IQ4_XS, SafeTensors NVFP4 (prequant), MXFP4. NVFP4 KV cache (--kv-nvfp4) for 4× context compression at decode parity.
LoRA PEFT adapter hot-swap (--lora name=path, per-request "lora" field) — runtime low-rank deltas, no weight patching, works with every quant path.
Attention Prefill: FP16 cuBLAS below the auto fmha_prefill_threshold (the largest chunk whose S-matrix fits, ~2.5k tokens), then the FMHA family above it — an mma.sync m16n8k32 FP8-E4M3 score kernel and a register-resident FlashAttention-2 kernel (head_dim 128). Decode: paged attention (block_size 16) switching on KV dtype (FP16/FP8/INT8/INT4/NVFP4/MXFP4). Auto-dispatch per phase × dtype × layer — see docs/attention-dispatch.md.
Architectures Dense transformers, Mixture-of-Experts (top-k grouped GEMM), Gated DeltaNet (fused recurrent scan), Mamba2 (SSM), SigLIP/Gemma-4v vision encoders.
sm_120a kernels NVFP4 block-scaled mma.sync mxf4nvf4 GEMM/GEMV (CUTLASS v4.5.1), FP8 f8f6f4 attention scores, FA2 block-scaling, packed cvt.e2m1/cvt.e4m3x2 dequant, PDL, Green Contexts. No tcgen05/TMEM/wgmma/TMA-WS — those are datacenter Blackwell only.
Server OpenAI /v1/chat/completions + /v1/completions + /v1/embeddings + /tokenize + /detokenize; Anthropic /v1/messages with real per-token SSE streaming and cache_control prompt caching (prefix-cache pinning + cache_read/cache_creation_input_tokens usage reporting; prefix cache default-on since #538); Prometheus /metrics. Tool/function calling, json_object + json_schema constrained decoding (whole-token validated), reasoning_content separation (DeepSeek format) + think budget. Strict single-model semantics (/v1/models lists only the loaded model; foreign names get a 404, no auto-swap). API-key auth, rate limiting, JSONL request logging.
Runtime CUDA Graphs (auto per model), imp.conf + CLI config, Jinja2 chat templates with macro support. degen_suite.py is the coherence quality-gate after hot-path changes.

How was this built?

imp is written entirely by Claude Code (Opus 4.6 → 4.8) — 900+ commits, ~97k lines of C++/CUDA in the engine itself (plus ~50k more in tooling and 91 test files / 970+ tests). A human (@kekzl) directs architecture decisions, picks models, and runs benchmarks. The AI agent writes all code: GGUF parser, BPE tokenizer, paged KV cache, CUTLASS NVFP4 GEMM dispatch, FP8/FA2 attention kernels, MoE routing, GDN/Mamba2 scan kernels, HTTP server, CI, the build system — nothing is copy-pasted from other engines.

This is a running experiment in how far AI-generated systems code scales. The architecture doc covers the end-to-end pipeline if you want to see how the pieces fit together.

Documentation

Document
Architecture (diagram) Load → engine init → prefill → decode pipeline
Supported models Tested families with VRAM + tok/s
Performance Throughput numbers + methodology
Usage & reference Build, server, CLI, C API
Quantization GGUF, NVFP4, MXFP4, FP8 KV — formats and trade-offs
Attention dispatch Per-(phase x dtype x layer) kernel selection
sm_120a kernels Kernel optimization notes
Determinism Reproducibility guarantees + known limits
imp.conf reference All runtime configuration keys
Roadmap Open issues and in-flight work
Changelog Per-release notes

Building from source

# With CUDA 13.3+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Or via Docker (canonical):
make build          # → imp:test image
make verify-fast    # build + tests + perf gate (~90s)

Full build options and test commands: docs/usage.md. Contributing: CONTRIBUTING.md.

License

MIT — see LICENSE.

Acknowledgements

Stands on the shoulders of llama.cpp. The GGUF format, GGML quantization schemes, and most practical conventions for local LLM inference were established there.

Built by @kekzl with Claude Code. Heavy use of CUTLASS v4.5.1 for sm_120a NVFP4/MXFP4 GEMM and grouped MoE kernels; the FMHA/FA2 attention kernels are hand-written mma.sync. Other references: Flash Attention 2, NVIDIA Model Optimizer, llm-compressor.

About

High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). Native NVFP4/GGUF, 270 tok/s decode on Qwen3-Coder-30B MoE. Written entirely by Claude Code.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors