Skip to content

literate-goggles/paracap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paracap

Paracap is a paralinguistic captioning toolkit that produces transcripts enriched with inline paralinguistic tags and caption files aligned to audio. The package exposes both a Python API and a paracap command line interface.

Features

  • Whisper/WhisperX automatic speech recognition integration
  • Optional pyannote diarization with graceful fallback
  • AudioSet-powered paralinguistic event tagging
  • Alignment of events to word timestamps and inline tag rendering
  • Writers for plain text, inline-tag text, VTT, SRT, and JSON summaries
  • Extensible configuration powered by Pydantic models

Installation

pip install paracap

For GPU support with CUDA 12.8, install torch with the appropriate index:

pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128

Optional dependencies can be installed via extras:

pip install paracap[asr]
pip install paracap[diarization]
pip install paracap[tagging]

During development install the package in editable mode and install dev dependencies:

pip install -e .[all]
pip install -r requirements.txt

Troubleshooting

cuDNN version conflicts: The CLI automatically preloads the nvidia-cudnn libraries to resolve version conflicts between ctranslate2 (requires cuDNN 9.1.x) and torch 2.8+ (ships with cuDNN 9.10.x). If you encounter cuDNN loading errors when using the Python API directly, either:

  1. Import paracap.cli first to trigger the preload, or
  2. Set LD_LIBRARY_PATH to include the nvidia cudnn lib path:
    export LD_LIBRARY_PATH=$(python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0])")/lib:$LD_LIBRARY_PATH

Quickstart

paracap transcribe path/to/audio.wav --whisper-size small --formats json,vtt,tags.txt

Python API example:

from paracap.pipeline import TranscriptionPipeline
from paracap.config import PipelineConfig

config = PipelineConfig()
TranscriptionPipeline.from_config(config).process_file("audio.wav")

Lightweight Event Detection

For filtering large audio datasets without full transcription:

from paracap import EventDetector

detector = EventDetector()

# Detect events in 24kHz audio array
result = detector.detect(audio_24k, sample_rate=24000)

# Or detect directly from file
result = detector.detect_file("audio.wav")

# Check for specific events
if result.has_events("Laughter"):
    print(f"Found {len(result.get_events('Laughter'))} laughter events")

# Access debug info
print(f"Max laughter probability: {result.max_probability('Laughter'):.3f}")
print(f"Raw probabilities shape: {result.probabilities['Laughter'].shape}")

Outputs are written next to the input by default:

  • audio.txt
  • audio.tags.txt
  • audio.vtt
  • audio.srt
  • audio.json

Configuration

Configuration is defined with Pydantic models and can be overridden via CLI options or YAML configuration files. Event tagging always exposes the complete label set provided by the underlying PANNs model, so no manual tag selection or mapping is necessary.

Key CLI flags:

  • --out-dir
  • --whisper-size
  • --language
  • --tag-thresholds (keys must match PANNs labels such as Laughter or Speech)
  • --hysteresis
  • --min-dur-seconds
  • --merge-gap
  • --snap-tolerance
  • --formats
  • --num-workers
  • --device
  • --hf-token-env
  • --save-intermediate
  • --verbose
  • paracap debug-events AUDIO runs only the event tagger and logs detected events to stdout (no files written). Combine with --tag-thresholds Laughter=0.4 to tune sensitivity interactively.

See paracap --help for full documentation.

Development

Run linting and tests:

pytest -q

Architecture Overview

graph LR
    A[Audio Input] -->|ffmpeg/soundfile| B[Pre-processing]
    B --> C[ASR (WhisperX)]
    B --> D[Tagging (PANNs)]
    B --> E[Diarization (pyannote)]
    C --> F[Word timestamps]
    D --> G[Events]
    E --> H[Speaker Segments]
    F --> I[Alignment]
    G --> I
    H --> I
    I --> J[Renderers]
    J --> K[txt]
    J --> L[tags.txt]
    J --> M[VTT/SRT]
    J --> N[JSON]
Loading

Examples

The examples/ directory contains sample audio files and expected outputs to verify the pipeline is working correctly.

Audio Samples

File Description Source Duration
examples/audio/laugh_clip.mp3 Laughter starting at time 0 Extracted 3s
examples/audio/applause.wav Crowd applause YouTube 5s
examples/audio/laughter_audioset.wav Baby laughter AudioSet 10s
examples/audio/cough_audioset.wav Cough/throat clearing AudioSet 10s

Running the Examples

Detect events only (no transcription):

# Detect laughter (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/laugh_clip.mp3

# Expected output:
# Detected 1 events in examples/audio/laugh_clip.mp3:
# Laughter  00:00:00.000–00:00:01.595  score=0.480
# Detect applause
paracap debug-events examples/audio/applause.wav

# Expected output:
# Detected 1 events in examples/audio/applause.wav:
# Applause  00:00:01.277–00:00:05.000  score=0.571
# Detect cough from AudioSet (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/cough_audioset.wav

# Expected output:
# Detected 1 events in examples/audio/cough_audioset.wav:
# Cough  00:00:00.000–00:00:03.836  score=0.458
# Detect multiple laughter events from AudioSet
paracap debug-events examples/audio/laughter_audioset.wav

# Expected output:
# Detected 2 events in examples/audio/laughter_audioset.wav:
# Laughter  00:00:00.320–00:00:02.238  score=0.652
# Laughter  00:00:03.516–00:00:05.115  score=0.763

Full transcription pipeline:

# Run full pipeline with ASR + event tagging
paracap transcribe examples/audio/laugh_clip.mp3 --no-diarize --formats json,tags.txt

# Output files:
# - laugh_clip.json (full transcript with events)
# - laugh_clip.tags.txt (inline transcript with [Laughter] tags)

Expected Outputs

Reference outputs are in examples/expected_output/ for validation.

Evaluation

Evaluate detection performance on the AudioSet eval subset:

# Evaluate on 100 samples
python scripts/evaluate_audioset.py --num-samples 100

# Evaluate on 300 samples and save results
python scripts/evaluate_audioset.py --num-samples 300 --output results.json

# Evaluate specific labels with failure analysis
python scripts/evaluate_audioset.py --num-samples 200 --labels Laughter,Cough --analyze-failures

The script downloads samples from AudioSet that match configured event labels and computes precision, recall, and F1 scores.

Benchmark Results (AudioSet eval, 233 samples)

Label Precision Recall F1
Applause 1.000 0.529 0.692
Cough 1.000 0.667 0.800
Laughter 0.965 0.573 0.719
Sigh 0.944 0.370 0.531
Overall 0.977 0.540 0.696

Thresholds tuned for F1 optimization. Gasp excluded due to unreliable PANNs detection (F1=0.12).

License

MIT

About

Audio transcription with automatic paralinguistic event tagging

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors