A local, privacy-first voice dictation app for Windows — similar to Wispr Flow. Hold a hotkey, speak, release — your words are transcribed by faster-whisper and optionally cleaned up by a local LLM, then pasted into whatever window is active.
All processing is 100% local. No internet required after setup.
- Hold-to-speak with configurable hotkey (default: Space)
- Silero VAD for automatic speech segmentation
- faster-whisper (tiny → large-v3) with CUDA / CPU fallback
- Optional LLM cleanup via llama-cpp-python + any GGUF model
- Minimal always-on-top overlay with live waveform
- Full settings GUI — no config file editing needed
- Windows 10 or 11
- Python 3.11+
- A microphone
- (Optional) NVIDIA GPU with CUDA 11.8+ for fast inference
git clone https://github.com/your-username/local-whisper.git
cd local-whisper
python -m venv .venv
.venv\Scripts\activateChoose the right command for your hardware at https://pytorch.org/get-started/locally/.
With CUDA 12.1 (recommended for NVIDIA GPUs):
pip install torch --index-url https://download.pytorch.org/whl/cu121CPU only:
pip install torch --index-url https://download.pytorch.org/whl/cpupip install -r requirements.txtThe default llama-cpp-python wheel is CPU-only. For GPU inference install the
pre-built CUDA wheel (replace cu121 with your CUDA version):
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121Or build from source (requires CMake and Visual Studio Build Tools):
set CMAKE_ARGS=-DGGML_CUDA=on
pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dirIf you skip this step the LLM cleanup step runs on CPU (slower but functional).
-
Create a free account at https://huggingface.co
-
Accept the Gemma licence at https://huggingface.co/google/gemma-2-2b-it
-
Download the quantised GGUF from the community repo: https://huggingface.co/bartowski/gemma-2-2b-it-GGUF
The file you want:
gemma-2-2b-it-Q4_K_M.gguf(~1.6 GB) -
Place it in the
models/directory:models/ └── gemma-2-2b-q4_k_m.gguf
You can use any GGUF model — set the path in Settings → LLM Cleanup.
python main.pyA small overlay appears in the bottom-right corner of your screen.
Note: The
keyboardlibrary captures global hotkeys and may require running the terminal as Administrator if certain keys don't register. Right-clicking the overlay → Quit to exit cleanly.
| Action | Result |
|---|---|
| Hold configured hotkey | Start recording; overlay shows 🔴 Listening… |
| Release hotkey | Stop recording; transcription begins |
| VAD detects silence (≥ 0.8 s) | Auto-triggers transcription even while holding |
| Transcription complete | Text is pasted into the active window |
| Right-click overlay | Open Settings or Quit |
| Drag overlay | Reposition it anywhere on screen |
Open Settings (right-click the overlay) → General tab → click the hotkey button and press the key you want to use.
Recommended keys to avoid conflicts with normal typing:
- F2 or F3 — easy to reach, rarely used by apps
- Right Ctrl — comfortable to hold
- Caps Lock — repurpose a rarely-used key
Settings are stored in config.json (auto-created on first run).
All options are accessible through the Settings GUI.
{
"hotkey": "space",
"whisper_model": "small",
"language": "auto",
"llm_enabled": true,
"llm_model_path": "models/gemma-2-2b-q4_k_m.gguf",
"gpu_layers": -1,
"vad_sensitivity": 0.5,
"insertion_method": "clipboard",
"autostart": false,
"unload_timeout_minutes": 5
}| Key | Values | Description |
|---|---|---|
hotkey |
keyboard key name | Key to hold for recording |
whisper_model |
tiny base small medium large-v2 large-v3 |
Whisper model size |
language |
auto or language code |
Transcription language |
llm_enabled |
true / false |
Enable LLM cleanup step |
llm_model_path |
file path | Path to GGUF model file |
gpu_layers |
-1 = all, 0 = CPU |
GPU layers for LLM |
vad_sensitivity |
0.0–1.0 |
Higher = less sensitive |
insertion_method |
clipboard / typewrite |
How text is pasted |
unload_timeout_minutes |
integer | Idle timeout before model unload |
| Hardware | Whisper small |
Gemma 2B Q4 |
|---|---|---|
| GTX 1650 Super, 16 GB RAM | ~0.8 s | ~1.2 s |
| CPU only (Ryzen 5) | ~3–6 s | ~5–10 s |
End-to-end latency target: < 3 s on GPU.
Models are loaded on first use and automatically unloaded after the configured idle timeout to keep background RAM under 200 MB.
Hotkey doesn't work — try running the terminal as Administrator.
"Cannot open microphone" — check that no other app has exclusive access to the mic, and that the correct device is selected in Settings → Audio.
LLM cleanup disabled / model not found — place the GGUF file at the path shown in Settings → LLM Cleanup, or browse to it.
Text not pasting — some apps (e.g. certain games, admin windows) block
Ctrl+V. Switch to Typewrite in Settings → Advanced.
CUDA not detected — ensure the CUDA-enabled PyTorch build is installed
(see step 2). The app falls back to CPU silently; check app.log for details.
Logs are written to app.log in the working directory.
LocalWhisper/
├── main.py # Entry point + AppController
├── config.json # User settings (auto-created)
├── app.log # Runtime log (auto-created)
├── core/
│ ├── audio.py # Mic capture + Silero VAD
│ ├── transcriber.py # faster-whisper wrapper
│ ├── llm.py # llama-cpp-python wrapper
│ └── inserter.py # Text insertion (clipboard / typewrite)
├── ui/
│ ├── overlay.py # Floating status widget
│ └── settings.py # Settings dialog
├── models/ # Place GGUF files here
└── requirements.txt
MIT