Paracap

Paracap is a paralinguistic captioning toolkit that produces transcripts enriched with inline paralinguistic tags and caption files aligned to audio. The package exposes both a Python API and a paracap command line interface.

Features

Whisper/WhisperX automatic speech recognition integration
Optional pyannote diarization with graceful fallback
AudioSet-powered paralinguistic event tagging
Alignment of events to word timestamps and inline tag rendering
Writers for plain text, inline-tag text, VTT, SRT, and JSON summaries
Extensible configuration powered by Pydantic models

Installation

pip install paracap

For GPU support with CUDA 12.8, install torch with the appropriate index:

pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128

Optional dependencies can be installed via extras:

pip install paracap[asr]
pip install paracap[diarization]
pip install paracap[tagging]

During development install the package in editable mode and install dev dependencies:

pip install -e .[all]
pip install -r requirements.txt

Troubleshooting

cuDNN version conflicts: The CLI automatically preloads the nvidia-cudnn libraries to resolve version conflicts between ctranslate2 (requires cuDNN 9.1.x) and torch 2.8+ (ships with cuDNN 9.10.x). If you encounter cuDNN loading errors when using the Python API directly, either:

Import paracap.cli first to trigger the preload, or

Set LD_LIBRARY_PATH to include the nvidia cudnn lib path:

export LD_LIBRARY_PATH=$(python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0])")/lib:$LD_LIBRARY_PATH

Quickstart

paracap transcribe path/to/audio.wav --whisper-size small --formats json,vtt,tags.txt

Python API example:

from paracap.pipeline import TranscriptionPipeline
from paracap.config import PipelineConfig

config = PipelineConfig()
TranscriptionPipeline.from_config(config).process_file("audio.wav")

Lightweight Event Detection

For filtering large audio datasets without full transcription:

from paracap import EventDetector

detector = EventDetector()

# Detect events in 24kHz audio array
result = detector.detect(audio_24k, sample_rate=24000)

# Or detect directly from file
result = detector.detect_file("audio.wav")

# Check for specific events
if result.has_events("Laughter"):
    print(f"Found {len(result.get_events('Laughter'))} laughter events")

# Access debug info
print(f"Max laughter probability: {result.max_probability('Laughter'):.3f}")
print(f"Raw probabilities shape: {result.probabilities['Laughter'].shape}")

Outputs are written next to the input by default:

audio.txt
audio.tags.txt
audio.vtt
audio.srt
audio.json

Configuration

Configuration is defined with Pydantic models and can be overridden via CLI options or YAML configuration files. Event tagging always exposes the complete label set provided by the underlying PANNs model, so no manual tag selection or mapping is necessary.

Key CLI flags:

--out-dir
--whisper-size
--language
--tag-thresholds (keys must match PANNs labels such as Laughter or Speech)
--hysteresis
--min-dur-seconds
--merge-gap
--snap-tolerance
--formats
--num-workers
--device
--hf-token-env
--save-intermediate
--verbose
paracap debug-events AUDIO runs only the event tagger and logs detected events to stdout (no files written). Combine with --tag-thresholds Laughter=0.4 to tune sensitivity interactively.

See paracap --help for full documentation.

Development

Run linting and tests:

pytest -q

Architecture Overview

graph LR
    A[Audio Input] -->|ffmpeg/soundfile| B[Pre-processing]
    B --> C[ASR (WhisperX)]
    B --> D[Tagging (PANNs)]
    B --> E[Diarization (pyannote)]
    C --> F[Word timestamps]
    D --> G[Events]
    E --> H[Speaker Segments]
    F --> I[Alignment]
    G --> I
    H --> I
    I --> J[Renderers]
    J --> K[txt]
    J --> L[tags.txt]
    J --> M[VTT/SRT]
    J --> N[JSON]

Examples

The examples/ directory contains sample audio files and expected outputs to verify the pipeline is working correctly.

Audio Samples

File	Description	Source	Duration
`examples/audio/laugh_clip.mp3`	Laughter starting at time 0	Extracted	3s
`examples/audio/applause.wav`	Crowd applause	YouTube	5s
`examples/audio/laughter_audioset.wav`	Baby laughter	AudioSet	10s
`examples/audio/cough_audioset.wav`	Cough/throat clearing	AudioSet	10s

Running the Examples

Detect events only (no transcription):

# Detect laughter (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/laugh_clip.mp3

# Expected output:
# Detected 1 events in examples/audio/laugh_clip.mp3:
# Laughter  00:00:00.000–00:00:01.595  score=0.480

# Detect applause
paracap debug-events examples/audio/applause.wav

# Expected output:
# Detected 1 events in examples/audio/applause.wav:
# Applause  00:00:01.277–00:00:05.000  score=0.571

# Detect cough from AudioSet (demonstrates edge detection - starts at 00:00:00.000)
paracap debug-events examples/audio/cough_audioset.wav

# Expected output:
# Detected 1 events in examples/audio/cough_audioset.wav:
# Cough  00:00:00.000–00:00:03.836  score=0.458

# Detect multiple laughter events from AudioSet
paracap debug-events examples/audio/laughter_audioset.wav

# Expected output:
# Detected 2 events in examples/audio/laughter_audioset.wav:
# Laughter  00:00:00.320–00:00:02.238  score=0.652
# Laughter  00:00:03.516–00:00:05.115  score=0.763

Full transcription pipeline:

# Run full pipeline with ASR + event tagging
paracap transcribe examples/audio/laugh_clip.mp3 --no-diarize --formats json,tags.txt

# Output files:
# - laugh_clip.json (full transcript with events)
# - laugh_clip.tags.txt (inline transcript with [Laughter] tags)

Expected Outputs

Reference outputs are in examples/expected_output/ for validation.

Evaluation

Evaluate detection performance on the AudioSet eval subset:

# Evaluate on 100 samples
python scripts/evaluate_audioset.py --num-samples 100

# Evaluate on 300 samples and save results
python scripts/evaluate_audioset.py --num-samples 300 --output results.json

# Evaluate specific labels with failure analysis
python scripts/evaluate_audioset.py --num-samples 200 --labels Laughter,Cough --analyze-failures

The script downloads samples from AudioSet that match configured event labels and computes precision, recall, and F1 scores.

Benchmark Results (AudioSet eval, 233 samples)

Label	Precision	Recall	F1
Applause	1.000	0.529	0.692
Cough	1.000	0.667	0.800
Laughter	0.965	0.573	0.719
Sigh	0.944	0.370	0.531
Overall	0.977	0.540	0.696

Thresholds tuned for F1 optimization. Gasp excluded due to unreliable PANNs detection (F1=0.12).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
paracap		paracap
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paracap

Features

Installation

Troubleshooting

Quickstart

Lightweight Event Detection

Configuration

Development

Architecture Overview

Examples

Audio Samples

Running the Examples

Expected Outputs

Evaluation

Benchmark Results (AudioSet eval, 233 samples)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paracap

Features

Installation

Troubleshooting

Quickstart

Lightweight Event Detection

Configuration

Development

Architecture Overview

Examples

Audio Samples

Running the Examples

Expected Outputs

Evaluation

Benchmark Results (AudioSet eval, 233 samples)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages