A fully offline, privacy-first voice assistant that runs entirely on local hardware. Athena combines a large mixture-of-experts language model (Qwen3.5), neural text-to-speech (Orpheus 3B), real-time speech recognition (Whisper), and a SNAC neural audio codec into a four-process pipeline — all in C++ with zero Python dependencies.
Athena speaks with natural emotion (laughs, sighs, gasps), maintains long conversational context, and runs on a single consumer GPU + system RAM. No cloud, no telemetry, no API keys.
┌──────────────────────┐
│ llama-server │
│ (Orpheus 3B GGUF) │
│ GPU inference │
│ -np 2 (2 slots) │
└──────┬───────────────┘
│ HTTP /completion
│ (all-ahead parallel
│ + 200ms stagger)
┌────────────────────────────────┐ │
│ orpheus-speak (daemon) │◄─────────┘
│ │
│ text → character-based chunks │
│ → concurrent HTTP submission │
│ → SNAC token parse │
│ → SNAC ONNX decode (GPU/CPU) │
│ → double-buffered WAV │
│ → pipelined playback │
│ + PulseAudio drain guard │
└────────────┬───────────────────┘
│ file watch (speak_tts.txt)
│
┌────────────┴───────────────────┐ ┌──────────────────┐
│ speak-daemon.sh │◄───────│ talk-llama │
│ (sanitize markdown/URLs, │ write │ (Whisper STT │
│ atomic trigger file write) │ │ + Qwen3.5 LLM │
└────────────────────────────────┘ │ + MoE CPU offload│
│ + voice I/O) │
└──────────────────┘
Four processes run simultaneously:
-
llama-server — runs Orpheus 3B on GPU with 2 parallel slots. Accepts text prompts via HTTP, outputs SNAC audio tokens via continuous batching.
-
orpheus-speak (daemon) — watches a trigger file. Splits long text into character-limited chunks (≤300 chars), submits all chunks as concurrent HTTP requests with retry-on-503, decodes SNAC tokens via ONNX Runtime, and plays audio through double-buffered WAV files with pipelined playback and PulseAudio drain guard.
-
talk-llama — Whisper speech-to-text + Qwen3.5 chat LLM. Writes assistant responses to a speak file. Supports MoE CPU offloading — expert FFN tensors live in system RAM while attention and shared experts run on GPU.
-
speak-daemon.sh — bridges talk-llama and orpheus-speak. Sanitizes markdown, URLs, and chat template tokens, then atomically writes to the trigger file.
- CMake ≥ 3.18
- C++17 compiler (gcc ≥ 8, clang ≥ 10)
- libcurl development headers
- SDL2 development headers (for microphone capture)
- ONNX Runtime (C/C++ GPU release)
- CUDA toolkit (12.x or 13.x)
- cuDNN 9.x (see installation below)
sudo apt install build-essential cmake git libcurl4-openssl-dev libsdl2-dev moreutilsmoreutils provides the ts timestamp utility used for instrumented logging.
ONNX Runtime GPU requires cuDNN 9.x. The CUDA toolkit does not include cuDNN.
-
Download the Full variant (not JIT) from developer.nvidia.com/cudnn-downloads. Select: Linux → x86_64 → Debian → deb (local) → FULL.
-
Install:
wget https://developer.download.nvidia.com/compute/cudnn/9.20.0/local_installers/cudnn-local-repo-debian12-9.20.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-debian12-9.20.0_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-debian12-9.20.0/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn9-cuda-12 # or cudnn9-cuda-13 for CUDA 13.x- Ensure
/etc/ld.so.confincludes the ONNX Runtime library path, then runsudo ldconfig.
Use Full, not JIT — JIT compiles kernels at runtime, adding latency.
All models can be downloaded via wget or curl.
| Variant | Size | Link |
|---|---|---|
| UD-Q4_K_XL (recommended) | 1.98 GB | Download |
| Q4_K_M | 1.94 GB | Download |
| Q8_0 (best quality) | 4.03 GB | Download |
mkdir -p ~/models/Orpheus
wget -O ~/models/Orpheus/orpheus-3b-0.1-ft-UD-Q4_K_XL.gguf \
https://huggingface.co/unsloth/orpheus-3b-0.1-ft-GGUF/resolve/main/orpheus-3b-0.1-ft-UD-Q4_K_XL.ggufwget -O ~/Orpheus/snac24_dynamic_fp16.onnx \
https://huggingface.co/onnx-community/snac_24khz-ONNX/resolve/main/onnx/decoder_model_fp16.onnxwget https://github.com/microsoft/onnxruntime/releases/download/v1.24.4/onnxruntime-linux-x64-gpu_cuda13-1.24.4.tgz
tar xzf onnxruntime-linux-x64-gpu_cuda13-1.24.4.tgzSplit GGUF shards — download all parts to the same directory.
| Variant | Size | Shards | RAM |
|---|---|---|---|
| UD-Q6_K_XL (recommended) | ~105 GB | 4 | 128+ GB |
| UD-Q4_K_XL (faster on DDR4) | ~58 GB | 2 | 96+ GB |
mkdir -p ~/models/QWEN-3.5-122B
for i in $(seq -w 1 4); do
wget -O ~/models/QWEN-3.5-122B/Qwen3.5-122B-A10B-UD-Q6_K_XL-0000${i}-of-00004.gguf \
"https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF/resolve/main/Qwen3.5-122B-A10B-UD-Q6_K_XL-0000${i}-of-00004.gguf"
done| Variant | Size | Shards | RAM |
|---|---|---|---|
| UD-Q3_K_XL | ~165 GB | 6 | 192 GB |
mkdir -p ~/models/QWEN-3.5-397B
for i in $(seq -w 1 6); do
wget -O ~/models/QWEN-3.5-397B/Qwen3.5-397B-A17B-UD-Q3_K_XL-0000${i}-of-00006.gguf \
"https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF/resolve/main/Qwen3.5-397B-A17B-UD-Q3_K_XL-0000${i}-of-00006.gguf"
donemkdir -p ~/models/Whisper
wget -O ~/models/Whisper/ggml-small.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bingit clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 \
-DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)Change -DCMAKE_CUDA_ARCHITECTURES=86 to 120 for Blackwell GPUs.
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cp ~/Orpheus/talk-llama.cpp examples/talk-llama/talk-llama.cpp
cmake -B build -DWHISPER_SDL2=ON -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 \
-DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)cd ~/Orpheus
cmake -B build \
-DONNXRUNTIME_ROOT=/home/user/onnxruntime-linux-x64-gpu_cuda13-1.24.4 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
cp speak-daemon.sh ~/models/Whisper/speak-daemon.sh && chmod +x ~/models/Whisper/speak-daemon.sh# Basic
./launch-athena.sh
# With timestamped logging
./launch-athena.sh 2>&1 | ts '[%Y-%m-%d %H:%M:%S]' | tee athena.logKernel tuning — create /etc/sysctl.d/99-athena.conf:
vm.swappiness = 0
vm.max_map_count = 262144
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2Memory locking — create /etc/security/limits.d/athena.conf:
user soft memlock unlimited
user hard memlock unlimited
Disable swap:
sudo swapoff -a
sudo sed -i '/\sswap\s/s/^/#/' /etc/fstab| Spec | |
|---|---|
| CPU | Intel i7-11800H (8C/16T) |
| RAM | 128 GB DDR4-3200 |
| GPU | NVIDIA RTX 3080 Laptop 16 GB (384 GB/s) |
| LLM | Qwen3.5-122B-A10B UD-Q6_K_XL |
| Context | 81,920 tokens |
| Threads | 15 (1 reserved for OS/audio pipeline) |
| Spec | |
|---|---|
| CPU | Intel Core Ultra 9 285HX (8P+16E / 32T) |
| RAM | 192 GB DDR5-4000 |
| GPU | NVIDIA RTX PRO 5000 Blackwell 24 GB GDDR7 (896 GB/s) |
| LLM | Qwen3.5-397B-A17B UD-Q3_K_XL |
| Context | 131,072 tokens |
| Threads | 28 (4 reserved for OS/audio pipeline) |
| Component | VRAM |
|---|---|
| 122B-A10B non-expert weights (Q6_K) | ~6.4 GB |
| KV cache bf16 (12 attn layers, 82K ctx) | ~1.9 GB |
| GatedDeltaNet state (36 linear-attn layers) | ~0.15 GB |
| Compute buffers | ~1.0 GB |
| Orpheus UD-Q4_K_XL + KV (q8_0, 9216, 2 slots) | ~2.75 GB |
| Whisper small | ~0.65 GB |
| SNAC decoder (GPU) | ~0.1 GB |
| CUDA overhead | ~0.5 GB |
| Total | ~13.5 GB |
| Free (16 GB GPU) | ~2.5 GB |
Long responses (>300 characters) are split into character-limited chunks and processed through a pipelined architecture:
-
Character-based chunking — groups sentences until the next would exceed 300 characters. Keeps each request within per-slot context budget.
-
All-ahead parallel submission — all chunks submitted as concurrent HTTP requests with 200ms stagger. llama-server processes 2 simultaneously.
-
Retry with backoff — HTTP 503 triggers up to 3 retries (500ms/1000ms).
-
Pipelined playback — each chunk plays via background thread with double-buffered WAV files while later chunks generate.
-
PulseAudio drain guard — prevents buffer flush on long audio chunks.
per_slot = total_context / n_slots
per_slot ≥ (CHUNK_CHAR_LIMIT × 6.6) + 50 prompt + 100 margin
| Metric | Sequential (-np 1) | Parallel (-np 2) |
|---|---|---|
| Single-shot latency | 5.6s | 3.4s |
| Wall-time efficiency | 53% | 80% |
| First-audio latency (long) | ~67s | ~34s |
| Voice | Description |
|---|---|
| tara | Female, conversational (best rated) |
| leah, jess, mia, zoe | Female |
| leo, dan, zac | Male |
<laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>
Orpheus outputs 7 tokens per audio frame. Each frame encodes 3 RVQ codebook
layers in interleaved order: [L0, L1a, L2a, L2b, L1b, L2c, L2d].
Each codebook has 4096 entries. Audio tokens start at <custom_token_10>.
24,000 Hz mono. Each SNAC frame = 512 samples ≈ 21.3ms. ~47 frames/second. 7 tokens/frame = ~329 tokens per second of audio.
| Problem | Solution |
|---|---|
| "no audio tokens in response" | Verify llama-server is running and Orpheus model loaded |
| CUDA EP unavailable | Install cuDNN 9, check /etc/ld.so.conf, run sudo ldconfig |
| Audio truncation | Ensure per-slot context ≥ 2048 tokens. Check CHUNK_CHAR_LIMIT |
| SNAC decoder init failed | Delete .optimized cache, retry. Add --snac-cpu if persistent |
| OOM | Reduce --ctx-size, drop Orpheus to Q4_K_M, check VRAM budget |
| Echo | Increase 200ms sleep in talk-llama.cpp |
| speak-daemon.sh hangs | Check orpheus-speak is running and file paths match |
| File | Purpose |
|---|---|
orpheus-speak.cpp |
TTS pipeline: HTTP client, parallel submission, SNAC decoder, chunked playback with drain guard |
CMakeLists.txt |
Build configuration for orpheus-speak |
talk-llama.cpp |
Custom whisper.cpp talk-llama: Athena prompt, MoE offload, VAD, goodbye detection, date awareness |
speak-daemon.sh |
talk-llama → orpheus-speak bridge (file-based IPC) |
launch-athena.sh |
Launcher for T15g Gen 2 with perf tuning |
launch-athena (for P16 Gen 3).sh |
Launcher for P16 Gen 3 with 397B support |
valperf.sh |
Runtime performance settings validator |
README.md |
This file |
MIT licensed. Orpheus weights: Llama 3.2 community license. SNAC codec: MIT. Qwen3.5: Apache-2.0.