Add Chatterbox text-to-speech backend#2256
Conversation
Integrate Resemble AI's Chatterbox TTS as a new backend supporting CUDA, ROCm, Metal, and CPU, defaulting to GPU acceleration with CPU fallback. Exposes the existing OpenAI-compatible /v1/audio/speech endpoint with byte-level PCM streaming. Registers English, Multilingual, and Turbo variants, with variant-aware selective downloads to avoid pulling the full multi-gigabyte repo. Bundles are built by the lemonade-sdk/chatterbox-rocm distribution repo.
|
Scheduled for next release - please do not merge before 10.8 releases. |
|
|
||
| prompt = body.get("audio_prompt_path") | ||
| voice = body.get("voice") | ||
| if not prompt and isinstance(voice, str) and os.path.isfile(voice): |
GitHub release assets are capped at 2 GiB; frozen torch+CUDA/ROCm bundles exceed that. Enable supports_split_archive and switch the Windows asset to .tar.gz (extracted via native tar) so the split-archive installer path serves all platforms.
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for the integration — the overall shape looks good, and the latest push addressing split archives is useful. I still think this should not merge yet.
Main blockers:
-
CodeQL is flagging an uncontrolled path expression in tools/chatterbox-server/main.py, and I think this is a real issue. The HTTP request can provide audio_prompt_path, and voice is also interpreted as a local host path when os.path.isfile(voice) succeeds. Please avoid accepting arbitrary host paths from API requests. Prefer registered voice IDs, uploaded/temp files, or a strict allowlisted directory with canonical-path validation.
-
The advertised platform matrix does not match the generated asset names. system_info.cpp marks Chatterbox CPU as supported on Windows/Linux/macOS for both x86_64 and arm64, but get_install_params() always emits windows-x64, linux-x64, and macos-arm64. Please either constrain the support matrix or generate arch-correct asset names.
-
Chatterbox-Multilingual needs explicit handling for missing language_id. Upstream requires it, while the wrapper only forwards it when present. This should return a clean 400 or use a documented default language.
-
Please reconsider the production from_local() -> from_pretrained() fallback. It can hide selective-download bugs and unexpectedly perform network downloads during model load. Lemonade should fail fast if the pulled snapshot is incomplete.
-
Please make sure the backend includes the new watchdog logic as well like #2252
Integrates Resemble AI's Chatterbox as a text-to-speech backend, served through the existing OpenAI-compatible
/v1/audio/speechendpoint.Highlights
generate_streamwhen available, with a single-chunk fallback otherwise.server_models.json).Pieces
tools/chatterbox-server/main.py— thin OpenAI-compatible HTTP wrapper aroundchatterbox-tts.ChatterboxServerC++ backend (WrappedServer+ITextToSpeechServer) with per-device install params.backend_versions.jsonpins + docs page + minor UI display name/order.Self-contained bundles are built by the companion lemonade-sdk/chatterbox-rocm distribution repo (tracks
chatterbox-ttsPyPI releases).