This tool uses OpenAI's GPT-4o transcription API to convert audio/video files to text with automatic speaker identification and labeling.
- Automatic Video Processing: Extracts audio from video files automatically
- Speaker Diarization: Identifies and labels different speakers (Speaker A, B, C, etc.)
- Smart Formatting: Groups consecutive segments by speaker with timestamps
- Combined Output: Merges split audio files into a single transcript with continuous timestamps
- Automatic Splitting: Handles files >25MB by automatically splitting and recombining
- Multi-speaker Support: Works with conversations, meetings, interviews, etc.
System Requirements:
- Python 3.11+
- ffmpeg (for audio extraction and splitting)
- OpenAI API key with access to GPT-4o models
Install ffmpeg:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html# Create virtual environment
python3 -m venv .
# Activate virtual environment
source bin/activate
# Install Python dependencies
pip install -r requirements.txtASSET_INPUT_DIR=/path/to/input/audio
ASSET_OUTPUT_DIR=/path/to/output/
OPENAI_API_KEY=your-api-key-here
# Optional: Custom video input directory (defaults to assets/video/)
VIDEO_INPUT_DIR=/path/to/video/input
# Optional: Test mode - stop after extraction without transcribing
TEST_MODE=falsepython3 main.py
# Or run in test mode (extract audio only, no API calls)
TEST_MODE=true python3 main.py- Place video files in
assets/video/directory (.mp4, .webm, .avi, .mov) - Run the script - it will automatically:
- Archive existing WAV files to
assets/audio/_old/ - Extract audio from videos to
assets/audio/(mono, 16kHz) - Transcribe all audio files with speaker diarization
- Move successfully transcribed videos to
assets/video/_old/
- Archive existing WAV files to
- Find transcripts in the output directory as
OUTPUT-{filename}.txt
Transcripts are formatted with speaker labels and timestamps for easy readability:
[00:00:15] Speaker A:
Hello everyone, thanks for joining today's meeting.
I wanted to discuss our progress on the new features.
We've made some great improvements this week.
[00:00:28] Speaker B:
That's great to hear.
Can you tell us more about the specific changes?
[00:00:35] Speaker A:
Absolutely, let me walk through them.
Format Details:
- Each speaker block starts with
[HH:MM:SS] Speaker X: - Consecutive segments from the same speaker are grouped together
- Text lines within a speaker block have no prefix
- Blank lines separate different speakers
- Output files are named
OUTPUT-{filename}.txt
- Phase 1 - Archive Audio: Moves existing WAV files to
assets/audio/_old/ - Phase 2 - Extract Video: Extracts audio from video files using ffmpeg (mono, 16kHz)
- Phase 3 - Transcribe: Processes all WAV files with OpenAI's diarization API
- Phase 4 - Archive Video: Moves successfully transcribed videos to
assets/video/_old/
- Files larger than 25MB are automatically split into chunks using ffmpeg
- Each chunk is transcribed with speaker diarization
- All chunks are combined into a single output file with continuous timestamps
- Timestamps are automatically adjusted across file boundaries
- Uses OpenAI's
gpt-4o-transcribe-diarizemodel - Automatically detects speaker changes
- Labels speakers as A, B, C, etc. (generic labels)
- Maintains speaker consistency across split file chunks
Videos are only archived after successful transcription. If transcription fails, the video remains in place for retry on the next run.
This tool uses OpenAI's gpt-4o-transcribe-diarize model, which may have different pricing than the standard Whisper model. Check OpenAI's pricing page for current rates.
Speakers are automatically labeled with generic identifiers (A, B, C, etc.). The model does not identify speakers by name unless you provide reference audio samples (not currently implemented in this tool).
If you prefer to extract audio manually or need custom settings:
ffmpeg -i meeting.webm -ac 1 -ar 16000 -vn meeting.wav-ac 1sets the audio channels to 1 (mono)-ar 16000sets the audio rate to 16kHz-vntells ffmpeg to skip including video
Place the resulting WAV file directly in the ASSET_INPUT_DIR directory.
If needed, you can manually split audio files:
ffmpeg -i meeting.wav -f segment -segment_time 30 -c copy output%03d.wavThis creates multiple 30-second files (output000.wav, output001.wav, etc.) from your original audio.