Skip to content

D3velop-llc/csm-rtx5090

Repository files navigation

CSM-1B on RTX 5090 — Streaming TTS with CUDA Graph Replay

High-performance inference pipeline for Sesame's CSM-1B text-to-speech model on NVIDIA RTX 5090 (Blackwell sm_120). Achieves ~0.46x real-time factor (avg across 10 test sentences) — generating audio ~2x faster than real-time.

The Problem

As of mid-2026, running CSM-1B on an RTX 5090 out of the box gives you:

Issue What Happens
PyTorch stable (any version through 2.11) sm_120 is not compatible — no Blackwell kernels, GPU ops return garbage
PyTorch nightly + HF Transformers naive 0.87x RTF — slower than real-time, no torch.compile benefit
torch.compile(mode="reduce-overhead") Crashes — HF's StaticCache.index_copy_() breaks CUDA graph replay

This repo solves all three. Here's what we measured:

Configuration RTF Notes
Eager (no compile) 0.87x Baseline
torch.compile(mode="default") 0.55x Kernel fusion only
This repo (reduce-overhead + patches) 0.46x CUDA graph replay on backbone

What's In The Box

File Purpose
csm_pipeline.py The pipeline — model loading, torch.compile, streaming, metrics
patch_transformers.py Patches HF Transformers for CUDA graph compatibility (4 files)
demo_server.py Web UI — type text, hear it spoken, with optional voice cloning
setup.sh One-command environment setup
requirements.txt Python dependencies (torch installed separately)

Quick Start

# 1. Prerequisites
#    - NVIDIA RTX 5090 (or any Blackwell GPU with sm_120)
#    - NVIDIA driver 565+ (Blackwell support)
#    - Python 3.12+
#    - Accept the sesame/csm-1b license: https://huggingface.co/sesame/csm-1b
#    - Login to HuggingFace (run outside venv or use token):
#      huggingface-cli login
#      # OR: python -c "from huggingface_hub import login; login()"

# 2. Setup
chmod +x setup.sh
./setup.sh

# 3. Set ptxas path (printed by setup.sh)
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

# 4. Run
source .venv/bin/activate
python csm_pipeline.py          # Generate + save output.wav
python demo_server.py            # Web UI on http://localhost:8080

Voice Cloning

CSM generates a different voice each time unless you provide reference audio as context:

# Web UI with voice reference
python demo_server.py --ref voice.wav --ref-text "What was said in the clip"
# Python API
from csm_pipeline import CSMStreamingPipeline, ConversationContext

pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()

ctx = ConversationContext()
ctx.add_turn_from_file(0, "What was said in the clip", "voice.wav")

for chunk in pipe.stream("New text in that voice", context=ctx.turns):
    play(chunk)  # float32 @ 24kHz

How The Optimizations Work

CSM generates 32 codebook tokens per audio frame (80ms). Each frame requires:

  • 1 backbone forward pass (1B params)
  • 31 depth decoder forward passes (~313M params)

That's 32 forward passes per 80ms of audio. Without compilation, each forward dispatches hundreds of individual GPU kernels with Python overhead between them.

What torch.compile does

Backbone — compiled with mode="reduce-overhead", fullgraph=True:

  • First call: PyTorch traces the forward, compiles it to Triton kernels, records as a CUDA graph
  • Subsequent calls: single GPU command replays the entire forward — no Python, no kernel dispatch overhead

Depth decoder — compiled with mode="default":

  • Kernel fusion (adjacent ops merged into single kernels), but no CUDA graph replay
  • Can't use reduce-overhead because each depth_decoder.generate() creates a fresh KV cache at new memory addresses, breaking graph replay

Why vanilla HF Transformers can't do this

Three things in HF's code break torch.compile:

Issue Location Fix
index_copy_() in cache update cache_utils.py Replace with slice assignment keys[:, :, pos] = new
torch.arange(tensor, tensor) modeling_csm.py (3 places) Use arange(int) + tensor pattern
Threading lock in output capturing output_capturing.py Pre-install hooks before compilation

Additionally, CUDA graph replay requires:

  • cudagraph_mark_step_begin() before each forward call in the generate loop
  • .clone() on backbone outputs before passing to depth decoder
  • A persistent backbone cache with mark_static_address — created once in eager mode, reset and reused across generate() calls

patch_transformers.py applies all of these automatically.

Why index_copy_ breaks and slice assignment doesn't

# BREAKS: index_copy_ bakes the position into the CUDA graph recording.
# On replay, position has advanced but the graph replays the old one.
self.keys.index_copy_(2, cache_position, key_states)

# WORKS: slice assignment reads cache_position from GPU memory at replay time.
# The graph records "read position from this address, write there."
# Between replays, cache_position's VALUE is updated, so writes go to the right place.
self.keys[:, :, cache_position] = key_states

Why the backbone cache must be persistent

HF's generate() creates a fresh StaticCache each call. StaticCache.lazy_initialization() calls torch._dynamo.mark_static_address() on the cache tensors — but only when NOT inside torch.compile tracing. If the cache is created during warmup compilation, the marking is skipped and CUDA graphs crash on replay.

The fix: create the cache once in eager mode (marking happens), then inject that same cache into every generate() call via a monkey-patched method. The compiled backbone graph always sees tensors at the same marked addresses.

Benchmark

from csm_pipeline import CSMStreamingPipeline
pipe = CSMStreamingPipeline.from_pretrained()
pipe.warmup()
pipe.benchmark("Your test sentence here.", iterations=5)

Typical results on RTX 5090 (32 GB GDDR7):

  [1] RTF=0.456x  TTFA=3138ms  gen=3022ms  audio=6880ms  frames=86
  [2] RTF=0.460x  TTFA=1178ms  gen=1135ms  audio=2560ms  frames=32
  [3] RTF=0.462x  TTFA=1219ms  gen=1171ms  audio=2640ms  frames=33
  [4] RTF=0.458x  TTFA=2164ms  gen=2086ms  audio=4720ms  frames=59
  [5] RTF=0.456x  TTFA=2697ms  gen=2598ms  audio=5920ms  frames=74
  Avg RTF: 0.459x

VRAM: ~4 GB peak (CSM-1B bf16 + Mimi codec + KV cache).

Compatibility

Tested with:

  • NVIDIA GeForce RTX 5090 (sm_120, driver 590.44.01)
  • PyTorch 2.12.0.dev (nightly, cu128)
  • transformers 5.5.0
  • Python 3.12
  • Ubuntu 24.04

Should work on any Blackwell GPU (RTX 5090, RTX 5080, etc.) with sm_120. May also work on Hopper (sm_90) with standard PyTorch stable — the patches are architecture-independent, only the nightly torch requirement is Blackwell-specific.

Troubleshooting

Symptom Fix
sm_120 is not compatible Install PyTorch nightly: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
No module named 'packaging' pip install packaging
No module named 'torchcodec' pip install torchcodec
No module named 'torchaudio' Install from nightly: pip install --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
Gibberish audio Wrong PyTorch — must have sm_120 kernels. Run python -c "import torch; torch.randn(10,device='cuda')*2" to verify
sesame/csm-1b 403 Forbidden Accept license at https://huggingface.co/sesame/csm-1b and huggingface-cli login
ptxas: sm_120 not defined export TRITON_PTXAS_PATH=/usr/local/cuda-12.8/bin/ptxas
Different voice each time Pass reference audio via ConversationContext
Slow first request Expected — run pipe.warmup() at startup, and do a context warmup pass with your reference audio
CUDA graphs may not be active Check patch_transformers.py --check — patches may not have applied

License

Pipeline code: MIT. CSM-1B model weights are subject to Sesame's license.

About

Optimized CSM-1B TTS pipeline for RTX 5090 (Blackwell sm_120). CUDA graph replay via patched HF Transformers. ~0.46x RTF. Topics (tags): csm text-to-speech rtx-5090 blackwell cuda-graphs torch-compile sesame streaming pytorch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors