A local web UI for the Qwen3-TTS model family, built with NiceGUI.
| Tab | Description |
|---|---|
| Custom Voice | Generate speech using one of 9 built-in speaker personas with optional style instructions |
| Voice Design | Describe a voice in plain text and synthesise speech with it |
| Voice Clone | Upload a short reference clip and clone that voice onto new text |
| Batch | Queue multiple Custom Voice items and generate them sequentially with per-item progress |
| Personas | Save and manage named voice presets (speaker + language + instruction) for quick reuse |
Models are loaded on demand and can be unloaded individually to free memory. Custom Voice and Voice Clone support both 0.6B and 1.7B checkpoints; Voice Design currently uses the 1.7B checkpoint only.
Additional runtime behavior:
- Choose the model size before loading when multiple checkpoints are available
- Choose the backend device before loading (
cuda:0,mps, orcpu, depending on your machine) - Loaded tabs show the active runtime as
device / dtype - On Apple Silicon, the app retries once in safer MPS
float32mode if generation fails with a probability-tensor stability error
- Python 3.11+
- uv
- A machine with MPS (Apple Silicon), CUDA, or enough RAM for CPU inference
git clone https://github.com/AlapinEnjoyer/qw3n-face.git
cd qw3n-face
uv syncuv run python main.pyThen app should auto open itself on http://localhost:8080.
Models are downloaded automatically from Hugging Face once requested in the app:
| Key | Checkpoint |
|---|---|
| Custom Voice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice or Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
| Voice Design | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
| Voice Clone | Qwen/Qwen3-TTS-12Hz-1.7B-Base or Qwen/Qwen3-TTS-12Hz-0.6B-Base |
Approximate checkpoint sizes vary by model variant; 0.6B models are substantially smaller than 1.7B models. Downloads are cached locally by Hugging Face after the first load.
- CUDA uses
bfloat16 - MPS prefers
float16, but the app can retry a failing model infloat32for stability - CPU prefers
bfloat16and falls back tofloat32if needed during model load - If Apple Silicon generation still fails on MPS, switch the backend device to
cpubefore loading the model
- Add automatic transcription of uploaded audio
- Add audio visualisation (waveform, spectrogram?)
- Add support for fine tuning