CSM-1B on RTX 5090 — Streaming TTS with CUDA Graph Replay

High-performance inference pipeline for Sesame's CSM-1B text-to-speech model on NVIDIA RTX 5090 (Blackwell sm_120). Achieves ~0.46x real-time factor (avg across 10 test sentences) — generating audio ~2x faster than real-time.

The Problem

As of mid-2026, running CSM-1B on an RTX 5090 out of the box gives you:

Issue	What Happens
PyTorch stable (any version through 2.11)	`sm_120 is not compatible` — no Blackwell kernels, GPU ops return garbage
PyTorch nightly + HF Transformers naive	0.87x RTF — slower than real-time, no `torch.compile` benefit
`torch.compile(mode="reduce-overhead")`	Crashes — HF's `StaticCache.index_copy_()` breaks CUDA graph replay

This repo solves all three. Here's what we measured:

Configuration	RTF	Notes
Eager (no compile)	0.87x	Baseline
`torch.compile(mode="default")`	0.55x	Kernel fusion only
This repo (reduce-overhead + patches)	0.46x	CUDA graph replay on backbone

What's In The Box

File	Purpose
`csm_pipeline.py`	The pipeline — model loading, torch.compile, streaming, metrics
`patch_transformers.py`	Patches HF Transformers for CUDA graph compatibility (4 files)
`demo_server.py`	Web UI — type text, hear it spoken, with optional voice cloning
`setup.sh`	One-command environment setup
`requirements.txt`	Python dependencies (torch installed separately)

Quick Start

# 1. Prerequisites
#    - NVIDIA RTX 5090 (or any Blackwell GPU with sm_120)
#    - NVIDIA driver 565+ (Blackwell support)
#    - Python 3.12+
#    - Accept the sesame/csm-1b license: https://huggingface.co/sesame/csm-1b
#    - Login to HuggingFace (run outside venv or use token):
#      huggingface-cli login
#      # OR: python -c "from huggingface_hub import login; login()"

# 2. Setup
chmod +x setup.sh
./setup.sh

# 3. Set ptxas path (printed by setup.sh)
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

# 4. Run
source .venv/bin/activate
python csm_pipeline.py          # Generate + save output.wav
python demo_server.py            # Web UI on http://localhost:8080

Voice Cloning

CSM generates a different voice each time unless you provide reference audio as context:

# Web UI with voice reference
python demo_server.py --ref voice.wav --ref-text "What was said in the clip"

# Python API
from csm_pipeline import CSMStreamingPipeline, ConversationContext

pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()

ctx = ConversationContext()
ctx.add_turn_from_file(0, "What was said in the clip", "voice.wav")

for chunk in pipe.stream("New text in that voice", context=ctx.turns):
    play(chunk)  # float32 @ 24kHz

How The Optimizations Work

CSM generates 32 codebook tokens per audio frame (80ms). Each frame requires:

1 backbone forward pass (1B params)
31 depth decoder forward passes (~313M params)

That's 32 forward passes per 80ms of audio. Without compilation, each forward dispatches hundreds of individual GPU kernels with Python overhead between them.

What `torch.compile` does

Backbone — compiled with mode="reduce-overhead", fullgraph=True:

First call: PyTorch traces the forward, compiles it to Triton kernels, records as a CUDA graph
Subsequent calls: single GPU command replays the entire forward — no Python, no kernel dispatch overhead

Depth decoder — compiled with mode="default":

Kernel fusion (adjacent ops merged into single kernels), but no CUDA graph replay
Can't use reduce-overhead because each depth_decoder.generate() creates a fresh KV cache at new memory addresses, breaking graph replay

Why vanilla HF Transformers can't do this

Three things in HF's code break torch.compile:

Issue	Location	Fix
`index_copy_()` in cache update	`cache_utils.py`	Replace with slice assignment `keys[:, :, pos] = new`
`torch.arange(tensor, tensor)`	`modeling_csm.py` (3 places)	Use `arange(int) + tensor` pattern
Threading lock in output capturing	`output_capturing.py`	Pre-install hooks before compilation

Additionally, CUDA graph replay requires:

cudagraph_mark_step_begin() before each forward call in the generate loop
.clone() on backbone outputs before passing to depth decoder
A persistent backbone cache with mark_static_address — created once in eager mode, reset and reused across generate() calls

patch_transformers.py applies all of these automatically.

Why `index_copy_` breaks and slice assignment doesn't

# BREAKS: index_copy_ bakes the position into the CUDA graph recording.
# On replay, position has advanced but the graph replays the old one.
self.keys.index_copy_(2, cache_position, key_states)

# WORKS: slice assignment reads cache_position from GPU memory at replay time.
# The graph records "read position from this address, write there."
# Between replays, cache_position's VALUE is updated, so writes go to the right place.
self.keys[:, :, cache_position] = key_states

Why the backbone cache must be persistent

HF's generate() creates a fresh StaticCache each call. StaticCache.lazy_initialization() calls torch._dynamo.mark_static_address() on the cache tensors — but only when NOT inside torch.compile tracing. If the cache is created during warmup compilation, the marking is skipped and CUDA graphs crash on replay.

The fix: create the cache once in eager mode (marking happens), then inject that same cache into every generate() call via a monkey-patched method. The compiled backbone graph always sees tensors at the same marked addresses.

Benchmark

from csm_pipeline import CSMStreamingPipeline
pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()
pipe.benchmark("Your test sentence here.", iterations=5)

Typical results on RTX 5090 (32 GB GDDR7):

  [1] RTF=0.456x  TTFA=3138ms  gen=3022ms  audio=6880ms  frames=86
  [2] RTF=0.460x  TTFA=1178ms  gen=1135ms  audio=2560ms  frames=32
  [3] RTF=0.462x  TTFA=1219ms  gen=1171ms  audio=2640ms  frames=33
  [4] RTF=0.458x  TTFA=2164ms  gen=2086ms  audio=4720ms  frames=59
  [5] RTF=0.456x  TTFA=2697ms  gen=2598ms  audio=5920ms  frames=74
  Avg RTF: 0.459x

VRAM: ~4 GB peak (CSM-1B bf16 + Mimi codec + KV cache).

Compatibility

Tested with:

NVIDIA GeForce RTX 5090 (sm_120, driver 590.44.01)
PyTorch 2.12.0.dev (nightly, cu128)
transformers 5.5.0
Python 3.12
Ubuntu 24.04

Should work on any Blackwell GPU (RTX 5090, RTX 5080, etc.) with sm_120. May also work on Hopper (sm_90) with standard PyTorch stable — the patches are architecture-independent, only the nightly torch requirement is Blackwell-specific.

Troubleshooting

Symptom	Fix
`sm_120 is not compatible`	Install PyTorch nightly: `pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128`
`No module named 'packaging'`	`pip install packaging`
`No module named 'torchcodec'`	`pip install torchcodec`
`No module named 'torchaudio'`	Install from nightly: `pip install --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128`
Gibberish audio	Wrong PyTorch — must have sm_120 kernels. Run `python -c "import torch; torch.randn(10,device='cuda')*2"` to verify
`sesame/csm-1b 403 Forbidden`	Accept license at https://huggingface.co/sesame/csm-1b and `huggingface-cli login`
`ptxas: sm_120 not defined`	`export TRITON_PTXAS_PATH=/usr/local/cuda-12.8/bin/ptxas`
Different voice each time	Pass reference audio via `ConversationContext`
Slow first request	Expected — run `pipe.warmup()` at startup, and do a context warmup pass with your reference audio
`CUDA graphs may not be active`	Check `patch_transformers.py --check` — patches may not have applied

License

Pipeline code: MIT. CSM-1B model weights are subject to Sesame's license.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
csm_pipeline.py		csm_pipeline.py
demo_server.py		demo_server.py
patch_transformers.py		patch_transformers.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSM-1B on RTX 5090 — Streaming TTS with CUDA Graph Replay

The Problem

What's In The Box

Quick Start

Voice Cloning

How The Optimizations Work

What `torch.compile` does

Why vanilla HF Transformers can't do this

Why `index_copy_` breaks and slice assignment doesn't

Why the backbone cache must be persistent

Benchmark

Compatibility

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CSM-1B on RTX 5090 — Streaming TTS with CUDA Graph Replay

The Problem

What's In The Box

Quick Start

Voice Cloning

How The Optimizations Work

What torch.compile does

Why vanilla HF Transformers can't do this

Why index_copy_ breaks and slice assignment doesn't

Why the backbone cache must be persistent

Benchmark

Compatibility

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

What `torch.compile` does

Why `index_copy_` breaks and slice assignment doesn't

Packages