A powerful, local speech-to-text transcription system that combines OpenAI's Whisper for accurate transcription with pyannote.audio for speaker diarization (identifying who spoke when). Perfect for meetings, interviews, podcasts, and any audio/video content that needs accurate transcription with speaker identification.
- High-Quality Transcription: Uses OpenAI's Whisper models (tiny to large) for accurate speech recognition
- Speaker Diarization: Identifies different speakers by voice patterns using pyannote.audio
- Video Support: Extract audio from video files and run complete video-to-text pipelines
- GPU Acceleration: Optimized for CUDA-enabled GPUs (RTX series) with 5-10x speed improvement
- Multiple Output Formats: JSON, TXT, SRT, VTT for different use cases
- Batch Processing: Process multiple files at once for efficiency
- Multi-language Support: Auto-detects language with excellent support for English and other languages
- Interactive Workflow: User-friendly guided workflow for beginners
- Flexible Audio Formats: Support for MP3, WAV, M4A, FLAC, OGG, WMA input/output
- Python 3.8 or higher
- FFmpeg (for video processing)
- CUDA-compatible GPU (optional, for acceleration)
- HuggingFace account and token (for speaker diarization)
git clone <your-repo-url>
cd transcriptorpip install -r requirements.txtNote: If you encounter issues with PyTorch, install it separately:
# For CUDA support (recommended)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CPU only
pip install torch torchaudioWindows:
- Download from ffmpeg.org
- Add to PATH or place in project directory
macOS:
brew install ffmpegLinux:
sudo apt update && sudo apt install ffmpeg-
Go to HuggingFace Settings
-
Create a new token
-
Accept terms at pyannote/speaker-diarization
-
Set environment variable using one of these methods:
Method 1: Set environment variable directly
# Windows (Command Prompt) set HF_TOKEN=your_token_here # Windows (PowerShell) $env:HF_TOKEN="your_token_here" # Linux/Mac export HF_TOKEN=your_token_here
Method 2: Create .env file (Recommended)
# Create .env file in project root echo HF_TOKEN=your_token_here > .env # Or manually create .env file with: # HF_TOKEN=your_token_here
Note: The .env file method is recommended as it persists across terminal sessions and is automatically loaded by the application.
The easiest way to get started is using the interactive workflow:
python transcribe_workflow.pyThis will guide you through the entire process with prompts, helping you choose:
- Input type (video or audio)
- Processing options
- Output preferences
# Basic transcription
python transcribe.py "path/to/audio.mp3"
# With speaker diarization
python transcribe.py "path/to/audio.mp3" --model small --device cuda
# Video to text (extracts audio first, then transcribes)
python video_to_text.py "path/to/video.mp4"
# Audio extraction only
python extract_audio.py "path/to/video.mp4" --format wav --quality highThe core transcription script with speaker diarization:
python transcribe.py "audio/meeting.wav" --model small --device cuda --output results/Options:
--model: Whisper model size (tiny, base, small, medium, large)--device: Device to use (cpu, cuda, auto)--output: Output directory (default: "output")
Model Selection Guide:
tiny: Fastest, good for English (39M parameters)base: Good balance of speed/accuracy (74M parameters)small: Better accuracy, moderate speed (244M parameters)medium: High accuracy, slower (769M parameters)large: Best accuracy, slowest (1550M parameters)
Complete pipeline from video to transcribed text:
python video_to_text.py "video/presentation.mp4" --whisper-model small --device cudaOptions:
--audio-format: Audio format (mp3, wav, m4a, flac, ogg)--audio-quality: Quality (low, medium, high)--whisper-model: Whisper model size--device: Device to use--keep-audio: Keep extracted audio file
Extract audio from video files:
python extract_audio.py "video.mp4" --format wav --quality high --output audio/Options:
--format: Output audio format--quality: Audio quality (affects bitrate/sample rate)--output: Output directory--batch: Process all videos in directory
User-friendly interface for all transcription tasks:
python transcribe_workflow.pyFeatures:
- Guided file selection
- Interactive option configuration
- Progress tracking
- Error handling and suggestions
The system generates multiple output formats for different use cases:
- JSON: Detailed transcription with timestamps, speaker info, and confidence scores
- TXT: Plain text transcription for easy reading
- SRT: Subtitle format with speaker labels for video players
- VTT: Web video subtitle format for web applications
Example JSON Output:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"speaker": "Speaker 1",
"text": "Hello, welcome to our meeting.",
"confidence": 0.95
}
]
}- Common: MP4, AVI, MOV, MKV, WMV, FLV
- Web: WebM, M4V
- Mobile: 3GP
- Lossy: MP3, M4A, OGG, WMA
- Lossless: WAV, FLAC
Whisper automatically detects the language. For best results:
- English: All models work excellently
- Other Languages: Use
mediumorlargemodels for better accuracy - Mixed Language: Large models handle code-switching well
- CUDA Users: Use
--device cudafor 5-10x faster processing - Memory Management: Close other GPU applications to avoid CUDA out of memory errors
- Model Selection: Balance between speed and accuracy based on your needs
- Short Files: Use
tinyorbasemodels for quick results - Long Files: Use
smallormediumfor better accuracy - Batch Processing: Process multiple files overnight for efficiency
HF_TOKEN: HuggingFace authentication token for speaker diarizationCUDA_VISIBLE_DEVICES: Specify which GPU to use (if multiple)
Edit the Python files to customize:
- Default model sizes
- Output directory structure
- Audio quality preferences
- Speaker diarization parameters
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121- Ensure you have a valid HuggingFace token
- Accept the pyannote.audio model terms
- Set
HF_TOKENenvironment variable correctly - Check token permissions
- Use smaller Whisper model (
tinyorbase) - Close other GPU applications
- Process shorter audio segments
- Use CPU if GPU memory is insufficient
- Install FFmpeg:
conda install ffmpegor download from ffmpeg.org - Check video file integrity
- Ensure video has audio track
- Use larger models for better accuracy
- Ensure clear audio input
- Check audio format compatibility
- Consider audio preprocessing for noisy files
- Check the output files for detailed error information
- Ensure all dependencies are installed correctly
- Verify your HuggingFace token is valid
- Check system requirements (Python version, FFmpeg, etc.)
-
Extract audio from meeting video:
python extract_audio.py "meeting.mp4" --format wav --quality high -
Transcribe with speaker identification:
python transcribe.py "meeting_audio.wav" --model small --device cuda -
View results:
- Check
output/folder for all formats - Open SRT file in video player for subtitles
- Use JSON for detailed analysis
- Check
-
Batch process multiple episodes:
python extract_audio.py "podcasts/" --batch --format mp3 --quality high -
Transcribe all episodes:
for file in audio/*.mp3; do python transcribe.py "$file" --model medium --device cuda done
-
Extract audio for editing:
python extract_audio.py "content.mp4" --format wav --quality high -
Generate subtitles:
python transcribe.py "content_audio.wav" --model small --device cuda
We welcome contributions! Here's how you can help:
- Report Issues: Use GitHub issues for bug reports and feature requests
- Submit PRs: Fork the repository and submit pull requests
- Improve Documentation: Help make the setup and usage clearer
- Add Features: Implement new output formats or processing options
git clone <your-fork-url>
cd transcriptor
pip install -r requirements.txt
pip install -e . # Install in development modeThis project is open source and available under the MIT License.
- OpenAI Whisper: For the excellent speech recognition models
- pyannote.audio: For speaker diarization capabilities
- MoviePy: For video processing and audio extraction
- FFmpeg: For multimedia processing
Made with β€οΈ for easy, local transcription
Transform your audio and video content into searchable, accessible text with professional-grade accuracy.