Qwen3-TTS-API

OpenAI-compatible Text-to-Speech API powered by Qwen3-TTS (Alibaba Qwen Team).

State-of-the-art multilingual TTS with 10 languages, voice cloning, voice design, and instruction-based control. Supports RTX 50-series (Blackwell) GPUs.

Features

OpenAI-compatible /v1/audio/speech endpoint
Voice cloning from 3-second reference audio
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Multiple model sizes: 0.6B (~3GB VRAM) and 1.7B (~6GB VRAM)
Streaming-ready architecture (97ms first-packet latency)

Quick Start

docker run -d --gpus all \
  -p 8080:8080 \
  -v /mnt/user/appdata/qwen3-tts-api/models:/root/.cache/huggingface \
  -e MODEL_ID=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --shm-size=4g \
  --name qwen3-tts-api \
  ghcr.io/hsiang-han/qwen3-tts-api:latest

Or with docker compose:

docker compose -f docker/gpu/docker-compose.yml up -d

First start downloads model files (~3-7GB) from HuggingFace. China users: set HF_ENDPOINT=https://hf-mirror.com for faster downloads.

Usage Examples

# Generate speech with built-in voice
curl -X POST http://localhost:8080/v1/audio/speech \
  -F "input=Hello, this is a test." \
  -F "voice=Vivian" \
  -F "language=English" \
  --output output.wav

# With emotion instruction
curl -X POST http://localhost:8080/v1/audio/speech \
  -F "input=我真的太开心了！" \
  -F "voice=Vivian" \
  -F "language=Chinese" \
  -F "instruct=用特别开心的语气说" \
  --output happy.wav

# List available voices
curl http://localhost:8080/v1/voices

Built-in Voices (CustomVoice model)

Voice	Description	Native Language
Vivian	Bright, slightly edgy young female	Chinese
Serena	Warm, gentle young female	Chinese
Uncle_Fu	Seasoned male, low mellow timbre	Chinese
Dylan	Youthful Beijing male, clear natural	Chinese (Beijing)
Eric	Lively Chengdu male, slightly husky	Chinese (Sichuan)
Ryan	Dynamic male, strong rhythmic drive	English
Aiden	Sunny American male, clear midrange	English
Ono_Anna	Playful Japanese female, light nimble	Japanese
Sohee	Warm Korean female, rich emotion	Korean

API Endpoints

Health Check

GET /health

List Models & Voices

GET /v1/models
GET /v1/voices

Text-to-Speech (CustomVoice / VoiceDesign)

POST /v1/audio/speech

Parameter	Type	Default	Description
input	string	required	Text to synthesize
voice	string	Vivian	Speaker name
language	string	Auto	Language (Auto, Chinese, English, Japanese, etc.)
instruct	string	null	Instruction for tone/emotion control

Voice Clone (Base model only)

POST /v1/audio/speech/clone

Parameter	Type	Default	Description
input	string	required	Text to synthesize
ref_audio	file	required	Reference audio file (WAV)
ref_text	string	null	Transcript of reference audio (improves quality)
language	string	Auto	Target language
x_vector_only	bool	false	Use speaker embedding only (no ICL)

Environment Variables

Variable	Default	Description
MODEL_ID	Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice	HuggingFace model ID or local path
DTYPE	bfloat16	Model precision (float16, bfloat16, float32)
DEVICE	cuda:0	Device to load model on
ATTN_IMPLEMENTATION	flash_attention_2	Attention backend (flash_attention_2, sdpa, eager)
PORT	8080	API server port
HF_HOME	/root/.cache/huggingface	HuggingFace cache directory
HF_ENDPOINT	https://huggingface.co	HuggingFace mirror (China: https://hf-mirror.com)

Available Models

Model ID	Type	VRAM	Features
Qwen/Qwen3-TTS-12Hz-0.6B-Base	Base	~3GB	Voice clone
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice	CustomVoice	~3GB	9 built-in voices
Qwen/Qwen3-TTS-12Hz-1.7B-Base	Base	~6GB	Voice clone
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice	CustomVoice	~6GB	9 built-in voices + instruction control
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign	VoiceDesign	~6GB	Design voice from text description

Hardware Requirements

NVIDIA GPU with 4GB+ VRAM (0.6B) or 8GB+ VRAM (1.7B)
NVIDIA driver 550+ (Ampere/Ada) or 570+ (Blackwell RTX 50-series)
Docker with NVIDIA Container Toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
api		api
docker		docker
templates		templates
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3-TTS-API

Features

Quick Start

Usage Examples

Built-in Voices (CustomVoice model)

API Endpoints

Health Check

List Models & Voices

Text-to-Speech (CustomVoice / VoiceDesign)

Voice Clone (Base model only)

Environment Variables

Available Models

Hardware Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3-TTS-API

Features

Quick Start

Usage Examples

Built-in Voices (CustomVoice model)

API Endpoints

Health Check

List Models & Voices

Text-to-Speech (CustomVoice / VoiceDesign)

Voice Clone (Base model only)

Environment Variables

Available Models

Hardware Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages