Skip to content

Add Chatterbox text-to-speech backend#2256

Open
Geramy wants to merge 2 commits into
mainfrom
geramy/chatterbox-implementation
Open

Add Chatterbox text-to-speech backend#2256
Geramy wants to merge 2 commits into
mainfrom
geramy/chatterbox-implementation

Conversation

@Geramy

@Geramy Geramy commented Jun 15, 2026

Copy link
Copy Markdown
Member

Integrates Resemble AI's Chatterbox as a text-to-speech backend, served through the existing OpenAI-compatible /v1/audio/speech endpoint.

Highlights

  • Multi-device: CUDA, ROCm, Metal, and CPU — auto-selects GPU when available, falls back to CPU.
  • Byte-level streaming: raw PCM16 @ 24 kHz via Chatterbox's generate_stream when available, with a single-chunk fallback otherwise.
  • Three variants: English, Multilingual, and Turbo (registered in server_models.json).
  • Selective downloads: only fetches the weight files each variant actually loads (~3 GB) instead of the full ~14 GB repo.

Pieces

  • tools/chatterbox-server/main.py — thin OpenAI-compatible HTTP wrapper around chatterbox-tts.
  • ChatterboxServer C++ backend (WrappedServer + ITextToSpeechServer) with per-device install params.
  • Recipe wiring: router, backend_utils, system_info (preference order), model_types, recipe_options, runtime_config, model_manager.
  • backend_versions.json pins + docs page + minor UI display name/order.

Self-contained bundles are built by the companion lemonade-sdk/chatterbox-rocm distribution repo (tracks chatterbox-tts PyPI releases).

Integrate Resemble AI's Chatterbox TTS as a new backend supporting CUDA,
ROCm, Metal, and CPU, defaulting to GPU acceleration with CPU fallback.
Exposes the existing OpenAI-compatible /v1/audio/speech endpoint with
byte-level PCM streaming. Registers English, Multilingual, and Turbo
variants, with variant-aware selective downloads to avoid pulling the
full multi-gigabyte repo.

Bundles are built by the lemonade-sdk/chatterbox-rocm distribution repo.
@Geramy Geramy requested review from fl0rianr and jeremyfowers June 15, 2026 20:40
@jeremyfowers jeremyfowers added this to the Lemonade v10.9 milestone Jun 15, 2026
@jeremyfowers

Copy link
Copy Markdown
Member

Scheduled for next release - please do not merge before 10.8 releases.

@github-actions github-actions Bot added engine::kokoro Kokoro TTS backend audio enhancement New feature or request labels Jun 15, 2026

prompt = body.get("audio_prompt_path")
voice = body.get("voice")
if not prompt and isinstance(voice, str) and os.path.isfile(voice):
GitHub release assets are capped at 2 GiB; frozen torch+CUDA/ROCm bundles
exceed that. Enable supports_split_archive and switch the Windows asset to
.tar.gz (extracted via native tar) so the split-archive installer path
serves all platforms.

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the integration — the overall shape looks good, and the latest push addressing split archives is useful. I still think this should not merge yet.

Main blockers:

  1. CodeQL is flagging an uncontrolled path expression in tools/chatterbox-server/main.py, and I think this is a real issue. The HTTP request can provide audio_prompt_path, and voice is also interpreted as a local host path when os.path.isfile(voice) succeeds. Please avoid accepting arbitrary host paths from API requests. Prefer registered voice IDs, uploaded/temp files, or a strict allowlisted directory with canonical-path validation.

  2. The advertised platform matrix does not match the generated asset names. system_info.cpp marks Chatterbox CPU as supported on Windows/Linux/macOS for both x86_64 and arm64, but get_install_params() always emits windows-x64, linux-x64, and macos-arm64. Please either constrain the support matrix or generate arch-correct asset names.

  3. Chatterbox-Multilingual needs explicit handling for missing language_id. Upstream requires it, while the wrapper only forwards it when present. This should return a clean 400 or use a documented default language.

  4. Please reconsider the production from_local() -> from_pretrained() fallback. It can hide selective-download bugs and unexpectedly perform network downloads during model load. Lemonade should fail fast if the pulled snapshot is incomplete.

  5. Please make sure the backend includes the new watchdog logic as well like #2252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio engine::kokoro Kokoro TTS backend enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants