Skip to content

leospark/FireRedVAD-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FireRedVAD-Engineering

License: MIT Python 3.8+ ONNX Validation Wheel

Production-Ready Streaming Voice Activity Detection (VAD) with ONNX Runtime


πŸ”₯ What's New

v1.1.0 (2026-03-18)

  • βœ… Package as Python wheel (pip install fireredvad-engineering)
  • βœ… Add command-line interface (CLI) tool
  • βœ… Include pre-built ONNX models and CMVN parameters
  • βœ… Verified ONNX vs PyTorch consistency (max diff < 1.19Γ—10⁻⁷)
  • βœ… All 230 test frames validated
  • βœ… Project is fully self-contained (includes fireredvad.core)

πŸ“– What is FireRedVAD?

FireRedVAD-Engineering is a production-ready, streaming Voice Activity Detection (VAD) system optimized for real-time applications. Built with ONNX runtime for maximum performance and cross-platform compatibility.

Based on the FireRedVAD model by FireRed Team (Xiaohongshu).

Key Features

  • ⚑ Real-time Streaming - Low-latency audio stream processing (RTF < 0.1)
  • 🎯 High Precision - ONNX vs PyTorch max difference < 1.19Γ—10⁻⁷ (validated)
  • πŸ“¦ Lightweight - Only ONNX Runtime + Kaldi Fbank required (~2.3MB)
  • πŸ”§ Easy Integration - Clean API and CLI for quick integration
  • πŸ’» Cross-Platform - Windows, Linux, macOS support
  • 🎀 Multiple Tasks - VAD (Voice Activity Detection) + AED (Audio Event Detection)

πŸš€ Quick Start

Installation

Option 1: Install from PyPI (Recommended)

pip install fireredvad-engineering

Option 2: Install from Source

# Clone the repository
git clone https://github.com/leospark/FireRedVAD-Engineering.git
cd FireRedVAD-Engineering

# Install dependencies
pip install -r requirements.txt

Usage

Option 1: Python API

from fireredvad.inference.streaming import StreamVAD, StreamVadConfig
import soundfile as sf

# Initialize VAD
config = StreamVadConfig(
    onnx_path="models/model_with_caches.onnx",
    cmvn_path="models/cmvn.ark",  # Required!
    speech_threshold=0.5
)
vad = StreamVAD(config)

# Process audio file
audio, sr = sf.read("audio.wav", dtype='int16')
segments = vad.process_audio(audio, sample_rate=sr)

# Output results
print(f"Detected {len(segments)} speech segments:")
for start, end, prob in segments:
    print(f"  {start:.2f}s - {end:.2f}s (confidence: {prob:.2f})")

Option 2: Command-Line Interface (CLI)

# Basic usage
fireredvad audio.wav

# Save results to JSON
fireredvad audio.wav --output segments.json

# Generate probability plot
fireredvad audio.wav --plot vad_plot.png

# Adjust sensitivity (lower threshold = more sensitive)
fireredvad audio.wav --threshold 0.3

# Verbose output
fireredvad audio.wav --verbose

πŸ“ Project Structure

FireRedVAD-Engineering/
β”œβ”€β”€ models/                          # Pre-trained models
β”‚   β”œβ”€β”€ model_with_caches.onnx       (96KB)   # ONNX model structure
β”‚   β”œβ”€β”€ model_with_caches.onnx.data  (2.2MB)  # Model weights
β”‚   └── cmvn.ark                     (1.3KB)  # ⚠️ Required! CMVN parameters
β”œβ”€β”€ fireredvad/                      # Python package
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ audio_feat.py            # Kaldi Fbank + CMVN feature extraction
β”‚   β”‚   └── detect_model.py          # PyTorch model (for validation)
β”‚   β”œβ”€β”€ cli.py                       # Command-line interface
β”‚   └── __init__.py
β”œβ”€β”€ inference/                       # Inference engine
β”‚   └── streaming.py                 # Streaming VAD engine
β”œβ”€β”€ examples/                        # Usage examples
β”‚   β”œβ”€β”€ demo.py                      # Basic detection demo
β”‚   └── plot_vad_prob.py             # Visualization tool
β”œβ”€β”€ tests/                           # Test scripts
β”‚   β”œβ”€β”€ verify_onnx_vs_pytorch.py    # ONNX vs PyTorch validation
β”‚   └── plot_comparison.py           # Generate comparison plots
β”œβ”€β”€ setup.py                         # Package setup
β”œβ”€β”€ pyproject.toml                   # Modern Python package config
β”œβ”€β”€ requirements.txt                 # Dependencies
└── README.md                        # This file

πŸ“Š Technical Specifications

Parameter Value Description
Sample Rate 16 kHz Required input sample rate
Frame Length 400 samples 25 ms
Frame Shift 160 samples 10 ms
Feature Dim 80 Log-Mel Fbank + CMVN
Model Size 2.3 MB ONNX format
Memory Usage < 50 MB Runtime memory
RTF (CPU) < 0.1 Real-time factor
Cache States 8 Γ— [1, 128, 19] Streaming cache tensors

πŸ”¬ Validation Results

ONNX vs PyTorch Consistency

Test Audio: Real Chinese speech (2.32s, 230 frames)

Metric Value
Total Frames Tested 230
Max Difference 1.19Γ—10⁻⁷
Mean Difference 3.29Γ—10⁻⁸
Median Difference 2.61Γ—10⁻⁸
< 1e-7 93.9% of frames βœ…
< 1e-5 100% of frames βœ…

Conclusion: βœ… All 230 frames have difference < 1e-5 (requirement threshold)

Speech Detection Test

Test Audio Duration Detected Segments Time Range Confidence
test_audio.wav 2.32s 1 segment 0.49s - 1.80s 0.99

πŸ”§ API Reference

StreamVadConfig

config = StreamVadConfig(
    onnx_path="models/model_with_caches.onnx",    # ONNX model path (required)
    cmvn_path="models/cmvn.ark",                   # CMVN parameters (required)
    sample_rate=16000,                             # Audio sample rate
    frame_shift_ms=10,                             # Frame shift in ms
    speech_threshold=0.5,                          # Detection threshold (0-1)
    smooth_window_size=5,                          # Smoothing window size
    min_speech_frame=8,                            # Min speech frames (80ms)
    min_silence_frame=20,                          # Min silence frames (200ms)
    pad_start_frame=5,                             # Padding frames at start
    use_gpu=False                                  # Use GPU acceleration
)

StreamVAD Class

vad = StreamVAD(config)

# Process complete audio
segments = vad.process_audio(audio, sample_rate=16000)
# Returns: [(start, end, probability), ...]

# Process single frame (for streaming)
result = vad.process_frame(audio_frame)
# Returns: FrameResult object

# Process audio chunk (for streaming)
results = vad.process_chunk(audio_chunk)
# Returns: [FrameResult, ...]

# Reset all states (for new audio stream)
vad.reset()

CLI Options

fireredvad audio.wav [OPTIONS]

Options:
  -o, --output PATH        Output JSON file for speech segments
  -p, --plot PATH          Save VAD probability plot to file
  -t, --threshold FLOAT    Speech detection threshold (0.0-1.0, default: 0.5)
  --min-speech FLOAT       Minimum speech duration in seconds (default: 0.08)
  --min-silence FLOAT      Minimum silence duration in seconds (default: 0.2)
  --model PATH             Path to ONNX model
  --cmvn PATH              Path to CMVN file
  -v, --verbose            Verbose output
  --help                   Show help message

πŸ’‘ Use Cases

1. ASR Preprocessing

segments = vad.process_audio(audio, sample_rate=16000)
for start, end, _ in segments:
    speech = audio[int(start*16000):int(end*16000)]
    text = asr.recognize(speech)

2. Meeting Analysis

segments = vad.process_audio(meeting_audio, sample_rate=16000)
for start, end, prob in segments:
    print(f"Speaking: {start:.1f}s - {end:.1f}s (conf: {prob:.2f})")

3. Voice Activity Ratio

total_duration = len(audio) / 16000
voice_duration = sum(end - start for start, end, _ in segments)
print(f"Voice Activity: {voice_duration/total_duration:.2%}")

4. Real-time Streaming

import sounddevice as sd

vad = StreamVAD(config)
vad.reset()

def audio_callback(chunk, frames, time, status):
    results = vad.process_chunk(chunk.flatten())
    if any(r.is_speech for r in results):
        print("Speech detected!")

stream = sd.InputStream(callback=audio_callback, channels=1, samplerate=16000)

βš™οΈ Dependencies

Core Requirements

onnxruntime>=1.10.0      # ONNX inference
numpy>=1.20.0            # Numerical operations
soundfile>=0.10.0        # Audio file I/O
kaldiio>=2.18.0          # Kaldi CMVN loading
kaldi-native-fbank>=1.7.0 # Kaldi Fbank feature extraction

Optional (for development)

torch>=1.10.0            # PyTorch (for validation)
matplotlib>=3.5.0        # Visualization
scipy>=1.7.0             # Audio processing

❓ FAQ

Q: Why is cmvn.ark required?

A: CMVN (Cepstral Mean and Variance Normalization) is critical for feature normalization. Without it, the model cannot properly process audio features and will output incorrect probabilities.

Q: How to process non-16kHz audio?

from scipy.signal import resample
audio_16k = resample(audio, int(len(audio) * 16000 / original_sr))

Q: How to handle stereo audio?

if audio.ndim == 2:
    audio = audio.mean(axis=1)  # Convert to mono

Q: Can I use this commercially?

A: Yes! MIT License allows commercial use.

Q: How to adjust sensitivity?

# More sensitive (detect quieter speech)
config = StreamVadConfig(speech_threshold=0.3)

# Stricter (fewer false positives)
config = StreamVadConfig(speech_threshold=0.7)

Q: What's the difference between model_with_caches.onnx and model.onnx?

  • model_with_caches.onnx: Optimized for streaming with explicit cache states (recommended)
  • model.onnx: Original model without cache optimization

πŸ—ΊοΈ Roadmap

v1.1.0 (Current)

  • βœ… Package as Python wheel
  • βœ… Add CLI tool
  • βœ… Include pre-built models
  • βœ… Improve documentation

Future Plans

  • Optimize for mobile deployment (TensorFlow Lite, CoreML)

πŸ“ License

MIT License - Free for personal and commercial use. See LICENSE for details.


πŸ™ Acknowledgments

  • Based on FireRedVAD model
  • Original authors: Kaituo Xu, Wenpeng Li, Kai Huang, Kun Liu (Xiaohongshu)
  • Thanks to the open source community

Last Updated: 2026-03-18
Version: 1.1.0
Status: βœ… Production Ready

Enjoy FireRedVAD! ⭐

About

Lightweight streaming Voice Activity Detection (VAD) tool with ONNX runtime

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages