Whisper + pyannote.audio Transcription System

A powerful, local speech-to-text transcription system that combines OpenAI's Whisper for accurate transcription with pyannote.audio for speaker diarization (identifying who spoke when). Perfect for meetings, interviews, podcasts, and any audio/video content that needs accurate transcription with speaker identification.

🚀 Features

High-Quality Transcription: Uses OpenAI's Whisper models (tiny to large) for accurate speech recognition
Speaker Diarization: Identifies different speakers by voice patterns using pyannote.audio
Video Support: Extract audio from video files and run complete video-to-text pipelines
GPU Acceleration: Optimized for CUDA-enabled GPUs (RTX series) with 5-10x speed improvement
Multiple Output Formats: JSON, TXT, SRT, VTT for different use cases
Batch Processing: Process multiple files at once for efficiency
Multi-language Support: Auto-detects language with excellent support for English and other languages
Interactive Workflow: User-friendly guided workflow for beginners
Flexible Audio Formats: Support for MP3, WAV, M4A, FLAC, OGG, WMA input/output

📋 Prerequisites

Python 3.8 or higher
FFmpeg (for video processing)
CUDA-compatible GPU (optional, for acceleration)
HuggingFace account and token (for speaker diarization)

🛠️ Installation

1. Clone the Repository

git clone <your-repo-url>
cd transcriptor

2. Install Dependencies

pip install -r requirements.txt

Note: If you encounter issues with PyTorch, install it separately:

# For CUDA support (recommended)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CPU only
pip install torch torchaudio

3. Install FFmpeg

Windows:

Download from ffmpeg.org
Add to PATH or place in project directory

macOS:

brew install ffmpeg

Linux:

sudo apt update && sudo apt install ffmpeg

4. Get HuggingFace Token (Required for Speaker Diarization)

Go to HuggingFace Settings
Create a new token
Accept terms at pyannote/speaker-diarization

Set environment variable using one of these methods:

Method 1: Set environment variable directly

# Windows (Command Prompt)
set HF_TOKEN=your_token_here

# Windows (PowerShell)
$env:HF_TOKEN="your_token_here"

# Linux/Mac
export HF_TOKEN=your_token_here

Method 2: Create .env file (Recommended)

# Create .env file in project root
echo HF_TOKEN=your_token_here > .env

# Or manually create .env file with:
# HF_TOKEN=your_token_here

Note: The .env file method is recommended as it persists across terminal sessions and is automatically loaded by the application.

🚀 Quick Start

Recommended: Interactive Workflow

The easiest way to get started is using the interactive workflow:

python transcribe_workflow.py

This will guide you through the entire process with prompts, helping you choose:

Input type (video or audio)
Processing options
Output preferences

Alternative: Direct Commands

# Basic transcription
python transcribe.py "path/to/audio.mp3"

# With speaker diarization
python transcribe.py "path/to/audio.mp3" --model small --device cuda

# Video to text (extracts audio first, then transcribes)
python video_to_text.py "path/to/video.mp4"

# Audio extraction only
python extract_audio.py "path/to/video.mp4" --format wav --quality high

📖 Detailed Usage

1. Audio Transcription (`transcribe.py`)

The core transcription script with speaker diarization:

python transcribe.py "audio/meeting.wav" --model small --device cuda --output results/

Options:

--model: Whisper model size (tiny, base, small, medium, large)
--device: Device to use (cpu, cuda, auto)
--output: Output directory (default: "output")

Model Selection Guide:

tiny: Fastest, good for English (39M parameters)
base: Good balance of speed/accuracy (74M parameters)
small: Better accuracy, moderate speed (244M parameters)
medium: High accuracy, slower (769M parameters)
large: Best accuracy, slowest (1550M parameters)

2. Video Processing (`video_to_text.py`)

Complete pipeline from video to transcribed text:

python video_to_text.py "video/presentation.mp4" --whisper-model small --device cuda

Options:

--audio-format: Audio format (mp3, wav, m4a, flac, ogg)
--audio-quality: Quality (low, medium, high)
--whisper-model: Whisper model size
--device: Device to use
--keep-audio: Keep extracted audio file

3. Audio Extraction (`extract_audio.py`)

Extract audio from video files:

python extract_audio.py "video.mp4" --format wav --quality high --output audio/

Options:

--format: Output audio format
--quality: Audio quality (affects bitrate/sample rate)
--output: Output directory
--batch: Process all videos in directory

4. Interactive Workflow (`transcribe_workflow.py`)

User-friendly interface for all transcription tasks:

python transcribe_workflow.py

Features:

Guided file selection
Interactive option configuration
Progress tracking
Error handling and suggestions

📁 Output Formats

The system generates multiple output formats for different use cases:

JSON: Detailed transcription with timestamps, speaker info, and confidence scores
TXT: Plain text transcription for easy reading
SRT: Subtitle format with speaker labels for video players
VTT: Web video subtitle format for web applications

Example JSON Output:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "speaker": "Speaker 1",
      "text": "Hello, welcome to our meeting.",
      "confidence": 0.95
    }
  ]
}

🎯 Supported Formats

Video Input

Common: MP4, AVI, MOV, MKV, WMV, FLV
Web: WebM, M4V
Mobile: 3GP

Audio Input/Output

Lossy: MP3, M4A, OGG, WMA
Lossless: WAV, FLAC

🌍 Language Support

Whisper automatically detects the language. For best results:

English: All models work excellently
Other Languages: Use medium or large models for better accuracy
Mixed Language: Large models handle code-switching well

⚡ Performance Optimization

GPU Acceleration

CUDA Users: Use --device cuda for 5-10x faster processing
Memory Management: Close other GPU applications to avoid CUDA out of memory errors
Model Selection: Balance between speed and accuracy based on your needs

Processing Tips

Short Files: Use tiny or base models for quick results
Long Files: Use small or medium for better accuracy
Batch Processing: Process multiple files overnight for efficiency

🔧 Configuration

Environment Variables

HF_TOKEN: HuggingFace authentication token for speaker diarization
CUDA_VISIBLE_DEVICES: Specify which GPU to use (if multiple)

Custom Settings

Edit the Python files to customize:

Default model sizes
Output directory structure
Audio quality preferences
Speaker diarization parameters

🚨 Troubleshooting

Common Issues

1. "No module named 'torch'"

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

2. Speaker diarization not working

Ensure you have a valid HuggingFace token
Accept the pyannote.audio model terms
Set HF_TOKEN environment variable correctly
Check token permissions

3. CUDA out of memory

Use smaller Whisper model (tiny or base)
Close other GPU applications
Process shorter audio segments
Use CPU if GPU memory is insufficient

4. Audio extraction fails

Install FFmpeg: conda install ffmpeg or download from ffmpeg.org
Check video file integrity
Ensure video has audio track

5. Poor transcription quality

Use larger models for better accuracy
Ensure clear audio input
Check audio format compatibility
Consider audio preprocessing for noisy files

Getting Help

Check the output files for detailed error information
Ensure all dependencies are installed correctly
Verify your HuggingFace token is valid
Check system requirements (Python version, FFmpeg, etc.)

📚 Example Workflows

Meeting Transcription

Extract audio from meeting video:

python extract_audio.py "meeting.mp4" --format wav --quality high

Transcribe with speaker identification:

python transcribe.py "meeting_audio.wav" --model small --device cuda

View results:
- Check output/ folder for all formats
- Open SRT file in video player for subtitles
- Use JSON for detailed analysis

Podcast Processing

Batch process multiple episodes:

python extract_audio.py "podcasts/" --batch --format mp3 --quality high

Transcribe all episodes:

for file in audio/*.mp3; do
  python transcribe.py "$file" --model medium --device cuda
done

Video Content Creation

Extract audio for editing:

python extract_audio.py "content.mp4" --format wav --quality high

Generate subtitles:

python transcribe.py "content_audio.wav" --model small --device cuda

🤝 Contributing

We welcome contributions! Here's how you can help:

Report Issues: Use GitHub issues for bug reports and feature requests
Submit PRs: Fork the repository and submit pull requests
Improve Documentation: Help make the setup and usage clearer
Add Features: Implement new output formats or processing options

Development Setup

git clone <your-fork-url>
cd transcriptor
pip install -r requirements.txt
pip install -e .  # Install in development mode

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

OpenAI Whisper: For the excellent speech recognition models
pyannote.audio: For speaker diarization capabilities
MoviePy: For video processing and audio extraction
FFmpeg: For multimedia processing

Made with ❤️ for easy, local transcription

Transform your audio and video content into searchable, accessible text with professional-grade accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
audio		audio
output		output
video		video
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_audio.py		extract_audio.py
requirements.txt		requirements.txt
transcribe.py		transcribe.py
transcribe_workflow.py		transcribe_workflow.py
video_to_text.py		video_to_text.py

Folders and files

Latest commit

History

Repository files navigation

Whisper + pyannote.audio Transcription System

🚀 Features

📋 Prerequisites

🛠️ Installation

1. Clone the Repository

2. Install Dependencies

3. Install FFmpeg

4. Get HuggingFace Token (Required for Speaker Diarization)

🚀 Quick Start

Recommended: Interactive Workflow

Alternative: Direct Commands

📖 Detailed Usage

1. Audio Transcription (transcribe.py)

2. Video Processing (video_to_text.py)

3. Audio Extraction (extract_audio.py)

4. Interactive Workflow (transcribe_workflow.py)

📁 Output Formats

🎯 Supported Formats

Video Input

Audio Input/Output

🌍 Language Support

⚡ Performance Optimization

GPU Acceleration

Processing Tips

🔧 Configuration

Environment Variables

Custom Settings

🚨 Troubleshooting

Common Issues

1. "No module named 'torch'"

2. Speaker diarization not working

3. CUDA out of memory

4. Audio extraction fails

5. Poor transcription quality

Getting Help

📚 Example Workflows

Meeting Transcription

Podcast Processing

Video Content Creation

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

1. Audio Transcription (`transcribe.py`)

2. Video Processing (`video_to_text.py`)

3. Audio Extraction (`extract_audio.py`)

4. Interactive Workflow (`transcribe_workflow.py`)