High-performance inference pipeline for Sesame's CSM-1B text-to-speech model on NVIDIA RTX 5090 (Blackwell sm_120). Achieves ~0.46x real-time factor (avg across 10 test sentences) — generating audio ~2x faster than real-time.
As of mid-2026, running CSM-1B on an RTX 5090 out of the box gives you:
| Issue | What Happens |
|---|---|
| PyTorch stable (any version through 2.11) | sm_120 is not compatible — no Blackwell kernels, GPU ops return garbage |
| PyTorch nightly + HF Transformers naive | 0.87x RTF — slower than real-time, no torch.compile benefit |
torch.compile(mode="reduce-overhead") |
Crashes — HF's StaticCache.index_copy_() breaks CUDA graph replay |
This repo solves all three. Here's what we measured:
| Configuration | RTF | Notes |
|---|---|---|
| Eager (no compile) | 0.87x | Baseline |
torch.compile(mode="default") |
0.55x | Kernel fusion only |
| This repo (reduce-overhead + patches) | 0.46x | CUDA graph replay on backbone |
| File | Purpose |
|---|---|
csm_pipeline.py |
The pipeline — model loading, torch.compile, streaming, metrics |
patch_transformers.py |
Patches HF Transformers for CUDA graph compatibility (4 files) |
demo_server.py |
Web UI — type text, hear it spoken, with optional voice cloning |
setup.sh |
One-command environment setup |
requirements.txt |
Python dependencies (torch installed separately) |
# 1. Prerequisites
# - NVIDIA RTX 5090 (or any Blackwell GPU with sm_120)
# - NVIDIA driver 565+ (Blackwell support)
# - Python 3.12+
# - Accept the sesame/csm-1b license: https://huggingface.co/sesame/csm-1b
# - Login to HuggingFace (run outside venv or use token):
# huggingface-cli login
# # OR: python -c "from huggingface_hub import login; login()"
# 2. Setup
chmod +x setup.sh
./setup.sh
# 3. Set ptxas path (printed by setup.sh)
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
# 4. Run
source .venv/bin/activate
python csm_pipeline.py # Generate + save output.wav
python demo_server.py # Web UI on http://localhost:8080CSM generates a different voice each time unless you provide reference audio as context:
# Web UI with voice reference
python demo_server.py --ref voice.wav --ref-text "What was said in the clip"# Python API
from csm_pipeline import CSMStreamingPipeline, ConversationContext
pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()
ctx = ConversationContext()
ctx.add_turn_from_file(0, "What was said in the clip", "voice.wav")
for chunk in pipe.stream("New text in that voice", context=ctx.turns):
play(chunk) # float32 @ 24kHzCSM generates 32 codebook tokens per audio frame (80ms). Each frame requires:
- 1 backbone forward pass (1B params)
- 31 depth decoder forward passes (~313M params)
That's 32 forward passes per 80ms of audio. Without compilation, each forward dispatches hundreds of individual GPU kernels with Python overhead between them.
Backbone — compiled with mode="reduce-overhead", fullgraph=True:
- First call: PyTorch traces the forward, compiles it to Triton kernels, records as a CUDA graph
- Subsequent calls: single GPU command replays the entire forward — no Python, no kernel dispatch overhead
Depth decoder — compiled with mode="default":
- Kernel fusion (adjacent ops merged into single kernels), but no CUDA graph replay
- Can't use
reduce-overheadbecause eachdepth_decoder.generate()creates a fresh KV cache at new memory addresses, breaking graph replay
Three things in HF's code break torch.compile:
| Issue | Location | Fix |
|---|---|---|
index_copy_() in cache update |
cache_utils.py |
Replace with slice assignment keys[:, :, pos] = new |
torch.arange(tensor, tensor) |
modeling_csm.py (3 places) |
Use arange(int) + tensor pattern |
| Threading lock in output capturing | output_capturing.py |
Pre-install hooks before compilation |
Additionally, CUDA graph replay requires:
cudagraph_mark_step_begin()before each forward call in the generate loop.clone()on backbone outputs before passing to depth decoder- A persistent backbone cache with
mark_static_address— created once in eager mode, reset and reused across generate() calls
patch_transformers.py applies all of these automatically.
# BREAKS: index_copy_ bakes the position into the CUDA graph recording.
# On replay, position has advanced but the graph replays the old one.
self.keys.index_copy_(2, cache_position, key_states)
# WORKS: slice assignment reads cache_position from GPU memory at replay time.
# The graph records "read position from this address, write there."
# Between replays, cache_position's VALUE is updated, so writes go to the right place.
self.keys[:, :, cache_position] = key_statesHF's generate() creates a fresh StaticCache each call. StaticCache.lazy_initialization() calls torch._dynamo.mark_static_address() on the cache tensors — but only when NOT inside torch.compile tracing. If the cache is created during warmup compilation, the marking is skipped and CUDA graphs crash on replay.
The fix: create the cache once in eager mode (marking happens), then inject that same cache into every generate() call via a monkey-patched method. The compiled backbone graph always sees tensors at the same marked addresses.
from csm_pipeline import CSMStreamingPipeline
pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()
pipe.benchmark("Your test sentence here.", iterations=5)Typical results on RTX 5090 (32 GB GDDR7):
[1] RTF=0.456x TTFA=3138ms gen=3022ms audio=6880ms frames=86
[2] RTF=0.460x TTFA=1178ms gen=1135ms audio=2560ms frames=32
[3] RTF=0.462x TTFA=1219ms gen=1171ms audio=2640ms frames=33
[4] RTF=0.458x TTFA=2164ms gen=2086ms audio=4720ms frames=59
[5] RTF=0.456x TTFA=2697ms gen=2598ms audio=5920ms frames=74
Avg RTF: 0.459x
VRAM: ~4 GB peak (CSM-1B bf16 + Mimi codec + KV cache).
Tested with:
- NVIDIA GeForce RTX 5090 (sm_120, driver 590.44.01)
- PyTorch 2.12.0.dev (nightly, cu128)
- transformers 5.5.0
- Python 3.12
- Ubuntu 24.04
Should work on any Blackwell GPU (RTX 5090, RTX 5080, etc.) with sm_120. May also work on Hopper (sm_90) with standard PyTorch stable — the patches are architecture-independent, only the nightly torch requirement is Blackwell-specific.
| Symptom | Fix |
|---|---|
sm_120 is not compatible |
Install PyTorch nightly: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128 |
No module named 'packaging' |
pip install packaging |
No module named 'torchcodec' |
pip install torchcodec |
No module named 'torchaudio' |
Install from nightly: pip install --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 |
| Gibberish audio | Wrong PyTorch — must have sm_120 kernels. Run python -c "import torch; torch.randn(10,device='cuda')*2" to verify |
sesame/csm-1b 403 Forbidden |
Accept license at https://huggingface.co/sesame/csm-1b and huggingface-cli login |
ptxas: sm_120 not defined |
export TRITON_PTXAS_PATH=/usr/local/cuda-12.8/bin/ptxas |
| Different voice each time | Pass reference audio via ConversationContext |
| Slow first request | Expected — run pipe.warmup() at startup, and do a context warmup pass with your reference audio |
CUDA graphs may not be active |
Check patch_transformers.py --check — patches may not have applied |
Pipeline code: MIT. CSM-1B model weights are subject to Sesame's license.