Production-Ready Streaming Voice Activity Detection (VAD) with ONNX Runtime
- β
Package as Python wheel (
pip install fireredvad-engineering) - β Add command-line interface (CLI) tool
- β Include pre-built ONNX models and CMVN parameters
- β Verified ONNX vs PyTorch consistency (max diff < 1.19Γ10β»β·)
- β All 230 test frames validated
- β Project is fully self-contained (includes fireredvad.core)
FireRedVAD-Engineering is a production-ready, streaming Voice Activity Detection (VAD) system optimized for real-time applications. Built with ONNX runtime for maximum performance and cross-platform compatibility.
Based on the FireRedVAD model by FireRed Team (Xiaohongshu).
- β‘ Real-time Streaming - Low-latency audio stream processing (RTF < 0.1)
- π― High Precision - ONNX vs PyTorch max difference < 1.19Γ10β»β· (validated)
- π¦ Lightweight - Only ONNX Runtime + Kaldi Fbank required (~2.3MB)
- π§ Easy Integration - Clean API and CLI for quick integration
- π» Cross-Platform - Windows, Linux, macOS support
- π€ Multiple Tasks - VAD (Voice Activity Detection) + AED (Audio Event Detection)
pip install fireredvad-engineering# Clone the repository
git clone https://github.com/leospark/FireRedVAD-Engineering.git
cd FireRedVAD-Engineering
# Install dependencies
pip install -r requirements.txtfrom fireredvad.inference.streaming import StreamVAD, StreamVadConfig
import soundfile as sf
# Initialize VAD
config = StreamVadConfig(
onnx_path="models/model_with_caches.onnx",
cmvn_path="models/cmvn.ark", # Required!
speech_threshold=0.5
)
vad = StreamVAD(config)
# Process audio file
audio, sr = sf.read("audio.wav", dtype='int16')
segments = vad.process_audio(audio, sample_rate=sr)
# Output results
print(f"Detected {len(segments)} speech segments:")
for start, end, prob in segments:
print(f" {start:.2f}s - {end:.2f}s (confidence: {prob:.2f})")# Basic usage
fireredvad audio.wav
# Save results to JSON
fireredvad audio.wav --output segments.json
# Generate probability plot
fireredvad audio.wav --plot vad_plot.png
# Adjust sensitivity (lower threshold = more sensitive)
fireredvad audio.wav --threshold 0.3
# Verbose output
fireredvad audio.wav --verboseFireRedVAD-Engineering/
βββ models/ # Pre-trained models
β βββ model_with_caches.onnx (96KB) # ONNX model structure
β βββ model_with_caches.onnx.data (2.2MB) # Model weights
β βββ cmvn.ark (1.3KB) # β οΈ Required! CMVN parameters
βββ fireredvad/ # Python package
β βββ core/
β β βββ audio_feat.py # Kaldi Fbank + CMVN feature extraction
β β βββ detect_model.py # PyTorch model (for validation)
β βββ cli.py # Command-line interface
β βββ __init__.py
βββ inference/ # Inference engine
β βββ streaming.py # Streaming VAD engine
βββ examples/ # Usage examples
β βββ demo.py # Basic detection demo
β βββ plot_vad_prob.py # Visualization tool
βββ tests/ # Test scripts
β βββ verify_onnx_vs_pytorch.py # ONNX vs PyTorch validation
β βββ plot_comparison.py # Generate comparison plots
βββ setup.py # Package setup
βββ pyproject.toml # Modern Python package config
βββ requirements.txt # Dependencies
βββ README.md # This file
| Parameter | Value | Description |
|---|---|---|
| Sample Rate | 16 kHz | Required input sample rate |
| Frame Length | 400 samples | 25 ms |
| Frame Shift | 160 samples | 10 ms |
| Feature Dim | 80 | Log-Mel Fbank + CMVN |
| Model Size | 2.3 MB | ONNX format |
| Memory Usage | < 50 MB | Runtime memory |
| RTF (CPU) | < 0.1 | Real-time factor |
| Cache States | 8 Γ [1, 128, 19] | Streaming cache tensors |
Test Audio: Real Chinese speech (2.32s, 230 frames)
| Metric | Value |
|---|---|
| Total Frames Tested | 230 |
| Max Difference | 1.19Γ10β»β· |
| Mean Difference | 3.29Γ10β»βΈ |
| Median Difference | 2.61Γ10β»βΈ |
| < 1e-7 | 93.9% of frames β |
| < 1e-5 | 100% of frames β |
Conclusion: β All 230 frames have difference < 1e-5 (requirement threshold)
| Test Audio | Duration | Detected Segments | Time Range | Confidence |
|---|---|---|---|---|
| test_audio.wav | 2.32s | 1 segment | 0.49s - 1.80s | 0.99 |
config = StreamVadConfig(
onnx_path="models/model_with_caches.onnx", # ONNX model path (required)
cmvn_path="models/cmvn.ark", # CMVN parameters (required)
sample_rate=16000, # Audio sample rate
frame_shift_ms=10, # Frame shift in ms
speech_threshold=0.5, # Detection threshold (0-1)
smooth_window_size=5, # Smoothing window size
min_speech_frame=8, # Min speech frames (80ms)
min_silence_frame=20, # Min silence frames (200ms)
pad_start_frame=5, # Padding frames at start
use_gpu=False # Use GPU acceleration
)vad = StreamVAD(config)
# Process complete audio
segments = vad.process_audio(audio, sample_rate=16000)
# Returns: [(start, end, probability), ...]
# Process single frame (for streaming)
result = vad.process_frame(audio_frame)
# Returns: FrameResult object
# Process audio chunk (for streaming)
results = vad.process_chunk(audio_chunk)
# Returns: [FrameResult, ...]
# Reset all states (for new audio stream)
vad.reset()fireredvad audio.wav [OPTIONS]
Options:
-o, --output PATH Output JSON file for speech segments
-p, --plot PATH Save VAD probability plot to file
-t, --threshold FLOAT Speech detection threshold (0.0-1.0, default: 0.5)
--min-speech FLOAT Minimum speech duration in seconds (default: 0.08)
--min-silence FLOAT Minimum silence duration in seconds (default: 0.2)
--model PATH Path to ONNX model
--cmvn PATH Path to CMVN file
-v, --verbose Verbose output
--help Show help messagesegments = vad.process_audio(audio, sample_rate=16000)
for start, end, _ in segments:
speech = audio[int(start*16000):int(end*16000)]
text = asr.recognize(speech)segments = vad.process_audio(meeting_audio, sample_rate=16000)
for start, end, prob in segments:
print(f"Speaking: {start:.1f}s - {end:.1f}s (conf: {prob:.2f})")total_duration = len(audio) / 16000
voice_duration = sum(end - start for start, end, _ in segments)
print(f"Voice Activity: {voice_duration/total_duration:.2%}")import sounddevice as sd
vad = StreamVAD(config)
vad.reset()
def audio_callback(chunk, frames, time, status):
results = vad.process_chunk(chunk.flatten())
if any(r.is_speech for r in results):
print("Speech detected!")
stream = sd.InputStream(callback=audio_callback, channels=1, samplerate=16000)onnxruntime>=1.10.0 # ONNX inference
numpy>=1.20.0 # Numerical operations
soundfile>=0.10.0 # Audio file I/O
kaldiio>=2.18.0 # Kaldi CMVN loading
kaldi-native-fbank>=1.7.0 # Kaldi Fbank feature extractiontorch>=1.10.0 # PyTorch (for validation)
matplotlib>=3.5.0 # Visualization
scipy>=1.7.0 # Audio processingA: CMVN (Cepstral Mean and Variance Normalization) is critical for feature normalization. Without it, the model cannot properly process audio features and will output incorrect probabilities.
from scipy.signal import resample
audio_16k = resample(audio, int(len(audio) * 16000 / original_sr))if audio.ndim == 2:
audio = audio.mean(axis=1) # Convert to monoA: Yes! MIT License allows commercial use.
# More sensitive (detect quieter speech)
config = StreamVadConfig(speech_threshold=0.3)
# Stricter (fewer false positives)
config = StreamVadConfig(speech_threshold=0.7)model_with_caches.onnx: Optimized for streaming with explicit cache states (recommended)model.onnx: Original model without cache optimization
- β Package as Python wheel
- β Add CLI tool
- β Include pre-built models
- β Improve documentation
- Optimize for mobile deployment (TensorFlow Lite, CoreML)
MIT License - Free for personal and commercial use. See LICENSE for details.
- Based on FireRedVAD model
- Original authors: Kaituo Xu, Wenpeng Li, Kai Huang, Kun Liu (Xiaohongshu)
- Thanks to the open source community
Last Updated: 2026-03-18
Version: 1.1.0
Status: β
Production Ready
Enjoy FireRedVAD! β