An elegant, interactive AI storytelling app powered by Boson AI's Higgs Audio v3 - the state-of-the-art open-weights TTS model with expressive emotions, sound effects, voice cloning, and 100+ language support.
- 🎭 Expressive Emotions - Inject
<|emotion:fear|>,<|emotion:awe|>,<|emotion:enthusiasm|>and 19 more inline - 🔊 Sound Effects - Native
<|sfx:laughter|>,<|sfx:sigh|>,<|sfx:sneeze|>and more - 🎙️ Style Control - Switch between
<|style:whispering|>,<|style:shouting|>,<|style:singing|> - ⚡ Prosody Control - Speed, pitch, pauses:
<|prosody:speed_slow|>,<|prosody:pitch_high|> - 🌍 100+ Languages - Single-digit WER/CER across 85+ production-quality languages
- 🔁 Zero-Shot Voice Cloning - Supply a reference WAV to replicate any voice
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| UI Framework | Streamlit ≥ 1.35 |
| Model Server | SGLang-Omni (Docker) |
| HTTP Client | requests |
| Audio I/O | soundfile, numpy |
| Model | Reza2kn/Higgs-Audio-v3-TTS-4bit-AWQ (4-bit AWQ, ~2.5 GB) |
Higgs Audio/
├── app.py ← Core Streamlit application (single file, ~120 lines)
├── requirements.txt ← Python dependencies (pip installable)
├── README.md ← This file
└── output.wav ← Auto-generated on each synthesis run (gitignored)
The single-file Streamlit frontend. Handles:
- Dark Obsidian-themed UI with custom CSS
- 3 premade story templates with Higgs v3 inline tags
- Manual text area for custom scripts
- Generation parameter controls (temperature, top-K, max tokens)
- HTTP POST to SGLang-Omni server at
http://localhost:8000/v1/audio/speech - In-browser WAV playback + one-click download
Minimal Python dependencies. The heavy model computation runs inside Docker (SGLang-Omni), so no GPU drivers or ML frameworks are needed in the Python environment.
- Windows 10/11 with PowerShell 5.1+
- Python 3.10+ (download)
- Docker Desktop for Windows (download) with GPU support enabled
- NVIDIA GPU with ≥ 4GB VRAM + NVIDIA Container Toolkit
pip install -r requirements.txtInstall the HuggingFace CLI and download the quantized model directly (no account or token required):
pip install huggingface_hub
huggingface-cli download Reza2kn/Higgs-Audio-v3-TTS-4bit-AWQ --local-dir .\modelThe model is ~2.5 GB. Downloads once and caches in the
model\folder.
Open a new PowerShell terminal:
streamlit run app.pyThe app opens automatically at http://localhost:8501
- Select a template from the dropdown (Fantasy, Thriller, or Kids Story)
- Edit or write your own story using Higgs v3 inline tags
- Adjust temperature, top-K, and max tokens as desired
- Click 🎬 Generate Story Audio
- Listen in the browser player or download the WAV file
Emotions: <|emotion:enthusiasm|> <|emotion:fear|> <|emotion:awe|> <|emotion:sadness|>
Style: <|style:whispering|> <|style:shouting|> <|style:singing|>
Sound FX: <|sfx:laughter|>Haha <|sfx:sigh|>Ugh <|sfx:sneeze|>Achoo
Prosody: <|prosody:speed_slow|> <|prosody:pitch_high|> <|prosody:long_pause|>
💡 Tip: Place emotion/style tokens at the start of the text. Place
<|sfx:...|>and pause tokens inline exactly where they fire.
| # | Use Case | Description |
|---|---|---|
| 1 | 🎮 Expressive Video Game NPC Dialogues | Generate dynamic, emotionally reactive NPC speech at runtime. Each dialogue line adapts tone based on in-game state - fear during combat, joy at victory, confusion when lost. |
| 2 | 📚 Immersive Audiobooks | Convert long-form written chapters into multi-voice narrated audio with automatic emotion pacing, dramatic pauses, and sound effects that match the text mood. |
| 3 | 🧘 AI Meditation Guides | Produce calm, slow-paced meditation scripts with whispering style, low pitch, and long pauses for breathing cues - fully customizable per session. |
| 4 | 📞 Real-Time Interactive Voice Response (IVR) | Power voice menus and AI call center responses with warm, natural-sounding speech rather than robotic TTS - switchable per brand persona. |
| 5 | 🧒 Dynamic Kids Storyboards | Create interactive bedtime stories where children hear characters laugh, gasp, sneeze, and sing - with the narration adapting to chosen story paths in real time. |
| # | Idea | Description |
|---|---|---|
| 1 | 🎵 Automated Ambient Background Music Mixdown | Analyse story mood and auto-select/mix royalty-free background music from a library, fading in/out with the narrative arc using pydub. |
| 2 | 📝 Visual Subtitle Tracking | Sync generated audio with word-level timestamps (via Whisper forced-alignment) and render scrolling karaoke-style subtitles in the browser. |
| 3 | 👥 Live Multi-Character Speaker Diarization | Assign distinct Higgs voice clones to named characters. Parse Character: "dialogue" format and stitch per-character audio into a single multi-voice scene render. |
| 4 | ⚡ Local Model Quantization with AWQ | Integrate the Reza2kn/Higgs-Audio-v3-TTS-4bit-AWQ 4-bit model for direct local inference without Docker - targeting ≥ 4GB VRAM consumer GPUs via AutoAWQ. |
| 5 | 🧠 Sentence-Level Emotion Auto-Generation via LLM Layer | Pass story text through a small LLM (e.g., Qwen2.5-3B) to automatically predict and inject the optimal Higgs emotion/style/sfx tags before TTS synthesis - zero manual tagging required. |
The underlying model (bosonai/higgs-audio-v3-tts-4b) is released under the Boson Higgs Audio v3 Research and Non-Commercial License.
This application code is MIT licensed.
Commercial deployment of the model requires a separate license from Boson AI.