A fun little terminal tool that turns text files into speech — with a colourful ASCII start screen, an arrow-key file picker, and an optional read-along view that highlights each word as it's spoken (karaoke-style).
Two voice engines:
- Kokoro (default) — runs locally and offline, and emits native per-word
timestamps, so it can drive the read-along view. Saves
.wav. - gTTS — fast, uses Google's servers, no timing data (no read-along).
Saves
.mp3.
- Drop a
.txt,.md, or.docxfile into thesource/folder. - Run the tool. From the menu you can generate speech, read along with a recording, manage (delete) recordings, download all voices for offline use, or change settings — split into Models (engine + voice) and Playback (resume where you left off).
- Generating voices the text with your chosen engine and saves the audio to
recordings/(.wavfor Kokoro,.mp3for gTTS). Each recording remembers the source file it came from, shown in the read-along picker. - Read-along plays a Kokoro recording back and highlights each word as it's
spoken, a page at a time, showing progress and the current heading —
spaceto pause,↑/↓for speed (live, pitch-corrected),←/→to seek ±10s,[/]to jump by block,+/-for volume,qorEscto quit. It can resume from where you left off (toggle in Settings → Playback). Headings, paragraphs, and lists from.md/.docxare rendered with styling and read with natural pauses. It also has vim-style commands:/text/?textto search (n/Nto repeat) and:for:toc(jump to a heading),:speed 1.5,:vol 80,:2:30(jump to time),:q, and:help. You can also scroll back while it reads —k/j,PgUp/PgDn, or the mouse wheel detach into "free look" (audio keeps playing);fsnaps back to following the cursor.
talkbox pins Python 3.12 (the Kokoro engine's spaCy dependency doesn't build
on newer Pythons yet), managed with uv. Install
uv once:
curl -LsSf https://astral.sh/uv/install.sh | shThen set up the project:
uv python install 3.12
uv venv --python 3.12
# CPU-only torch keeps the Kokoro install ~200MB rather than pulling CUDA wheels
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install -r requirements.txtRead-along playback uses mpv, so install it once at system
level (it provides libmpv, which python-mpv binds to):
sudo pacman -S mpv # or your distro's equivalent / `brew install mpv`talkbox auto-detects libmpv, including Homebrew's /opt/homebrew/lib on Apple
Silicon. If it's installed somewhere non-standard and read-along can't find it,
set the path in Settings → Advanced.
Everything is free and runs locally — no API keys or accounts. Kokoro downloads its model on first use and each voice the first time you pick it; after that it works fully offline. To grab everything ahead of time (e.g. before a trip with no connection), use Download all voices for offline use from the menu. gTTS always needs an internet connection.
.venv/bin/python main.pyPick Generate, choose a file, and it voices it. With the Kokoro engine you can then pick Read along to watch it read back. You'll also get a ready-to-paste play command:
Put the bundled talkbox launcher on your PATH and you can voice + read along
with any file in a single command, from any directory:
ln -s "$PWD/talkbox" ~/.local/bin/talkbox # once
talkbox ~/notes.mdIt generates with your saved default engine/voice and drops straight into
read-along — reusing an existing recording (and resuming) if you've voiced that
file before. If you've since changed your default voice, it asks whether to
re-render (and can stop asking for that file). Accepts .txt, .md, .docx,
by absolute or relative path.
ffplay recordings/prp.wav # any FFmpeg/audio player worksRead-along plays audio itself. The
ffplayhint is just for playing the file outside the app; a player isn't required to generate audio.
talkbox/
├── main.py # TUI: start screen, menu, generate / read-along / voices / settings
├── engines.py # synthesis + per-word timing (gTTS + Kokoro)
├── loaders.py # read .txt / .md / .docx into plain text
├── player.py # read-along karaoke playback (mpv)
├── settings.py # persisted engine + voice (config.json)
├── requirements.txt # dependencies
├── assets/ # README logo
├── docs/agent/ # design rationale and context
├── source/ # put your .txt/.md/.docx input files here (contents gitignored)
│ └── .gitkeep # placeholder so the folder ships empty
└── recordings/ # generated audio + metadata sidecars (contents gitignored)
└── .gitkeep
The repo ships with empty
source/andrecordings/folders (just a.gitkeepeach). Your inputs, the generated recordings and their*.talkbox.jsonmetadata, andconfig.jsonare all gitignored, so drop your own files intosource/to get started.
| Package | Why |
|---|---|
kokoro |
Local TTS with native per-word timestamps |
gtts |
Google Text-to-Speech (fast, no timing) |
soundfile |
Write Kokoro audio to .wav |
python-mpv |
Read-along playback with live, pitch-corrected speed |
markdown + beautifulsoup4 |
Extract clean text from .md |
python-docx |
Extract text from .docx |
questionary |
Arrow-key menus and file picker |
rich |
Colours, panels, progress bar, karaoke view |
pyfiglet |
ASCII-art banner |
- Best viewed in a truecolor terminal (most modern terminals qualify) so the gradient banner and spinner render in full colour.
- Generated recordings (audio + metadata) live in
recordings/and are gitignored; yoursource/inputs are kept locally too (also gitignored).
This project is licensed under the MIT License - see the LICENSE file for details.
Built with 💻 by Gabriel Guimaraes