Paracap is a paralinguistic captioning toolkit that produces transcripts enriched with inline paralinguistic tags and caption files aligned to audio. The package exposes both a Python API and a paracap command line interface.
- Whisper/WhisperX automatic speech recognition integration
- Optional pyannote diarization with graceful fallback
- AudioSet-powered paralinguistic event tagging
- Alignment of events to word timestamps and inline tag rendering
- Writers for plain text, inline-tag text, VTT, SRT, and JSON summaries
- Extensible configuration powered by Pydantic models
pip install paracapFor GPU support with CUDA 12.8, install torch with the appropriate index:
pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128Optional dependencies can be installed via extras:
pip install paracap[asr]
pip install paracap[diarization]
pip install paracap[tagging]During development install the package in editable mode and install dev dependencies:
pip install -e .[all]
pip install -r requirements.txtcuDNN version conflicts: The CLI automatically preloads the nvidia-cudnn libraries to resolve version conflicts between ctranslate2 (requires cuDNN 9.1.x) and torch 2.8+ (ships with cuDNN 9.10.x). If you encounter cuDNN loading errors when using the Python API directly, either:
- Import
paracap.clifirst to trigger the preload, or - Set
LD_LIBRARY_PATHto include the nvidia cudnn lib path:export LD_LIBRARY_PATH=$(python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0])")/lib:$LD_LIBRARY_PATH
paracap transcribe path/to/audio.wav --whisper-size small --formats json,vtt,tags.txtPython API example:
from paracap.pipeline import TranscriptionPipeline
from paracap.config import PipelineConfig
config = PipelineConfig()
TranscriptionPipeline.from_config(config).process_file("audio.wav")For filtering large audio datasets without full transcription:
from paracap import EventDetector
detector = EventDetector()
# Detect events in 24kHz audio array
result = detector.detect(audio_24k, sample_rate=24000)
# Or detect directly from file
result = detector.detect_file("audio.wav")
# Check for specific events
if result.has_events("Laughter"):
print(f"Found {len(result.get_events('Laughter'))} laughter events")
# Access debug info
print(f"Max laughter probability: {result.max_probability('Laughter'):.3f}")
print(f"Raw probabilities shape: {result.probabilities['Laughter'].shape}")Outputs are written next to the input by default:
audio.txtaudio.tags.txtaudio.vttaudio.srtaudio.json
Configuration is defined with Pydantic models and can be overridden via CLI options or YAML configuration files. Event tagging always exposes the complete label set provided by the underlying PANNs model, so no manual tag selection or mapping is necessary.
Key CLI flags:
--out-dir--whisper-size--language--tag-thresholds(keys must match PANNs labels such asLaughterorSpeech)--hysteresis--min-dur-seconds--merge-gap--snap-tolerance--formats--num-workers--device--hf-token-env--save-intermediate--verboseparacap debug-events AUDIOruns only the event tagger and logs detected events to stdout (no files written). Combine with--tag-thresholds Laughter=0.4to tune sensitivity interactively.
See paracap --help for full documentation.
Run linting and tests:
pytest -qgraph LR
A[Audio Input] -->|ffmpeg/soundfile| B[Pre-processing]
B --> C[ASR (WhisperX)]
B --> D[Tagging (PANNs)]
B --> E[Diarization (pyannote)]
C --> F[Word timestamps]
D --> G[Events]
E --> H[Speaker Segments]
F --> I[Alignment]
G --> I
H --> I
I --> J[Renderers]
J --> K[txt]
J --> L[tags.txt]
J --> M[VTT/SRT]
J --> N[JSON]
The examples/ directory contains sample audio files and expected outputs to verify the pipeline is working correctly.
| File | Description | Source | Duration |
|---|---|---|---|
examples/audio/laugh_clip.mp3 |
Laughter starting at time 0 | Extracted | 3s |
examples/audio/applause.wav |
Crowd applause | YouTube | 5s |
examples/audio/laughter_audioset.wav |
Baby laughter | AudioSet | 10s |
examples/audio/cough_audioset.wav |
Cough/throat clearing | AudioSet | 10s |
Detect events only (no transcription):
# Detect laughter (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/laugh_clip.mp3
# Expected output:
# Detected 1 events in examples/audio/laugh_clip.mp3:
# Laughter 00:00:00.000–00:00:01.595 score=0.480# Detect applause
paracap debug-events examples/audio/applause.wav
# Expected output:
# Detected 1 events in examples/audio/applause.wav:
# Applause 00:00:01.277–00:00:05.000 score=0.571# Detect cough from AudioSet (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/cough_audioset.wav
# Expected output:
# Detected 1 events in examples/audio/cough_audioset.wav:
# Cough 00:00:00.000–00:00:03.836 score=0.458# Detect multiple laughter events from AudioSet
paracap debug-events examples/audio/laughter_audioset.wav
# Expected output:
# Detected 2 events in examples/audio/laughter_audioset.wav:
# Laughter 00:00:00.320–00:00:02.238 score=0.652
# Laughter 00:00:03.516–00:00:05.115 score=0.763Full transcription pipeline:
# Run full pipeline with ASR + event tagging
paracap transcribe examples/audio/laugh_clip.mp3 --no-diarize --formats json,tags.txt
# Output files:
# - laugh_clip.json (full transcript with events)
# - laugh_clip.tags.txt (inline transcript with [Laughter] tags)Reference outputs are in examples/expected_output/ for validation.
Evaluate detection performance on the AudioSet eval subset:
# Evaluate on 100 samples
python scripts/evaluate_audioset.py --num-samples 100
# Evaluate on 300 samples and save results
python scripts/evaluate_audioset.py --num-samples 300 --output results.json
# Evaluate specific labels with failure analysis
python scripts/evaluate_audioset.py --num-samples 200 --labels Laughter,Cough --analyze-failuresThe script downloads samples from AudioSet that match configured event labels and computes precision, recall, and F1 scores.
| Label | Precision | Recall | F1 |
|---|---|---|---|
| Applause | 1.000 | 0.529 | 0.692 |
| Cough | 1.000 | 0.667 | 0.800 |
| Laughter | 0.965 | 0.573 | 0.719 |
| Sigh | 0.944 | 0.370 | 0.531 |
| Overall | 0.977 | 0.540 | 0.696 |
Thresholds tuned for F1 optimization. Gasp excluded due to unreliable PANNs detection (F1=0.12).
MIT