Audio Transcription Tool with Speaker Diarization

This tool uses OpenAI's GPT-4o transcription API to convert audio/video files to text with automatic speaker identification and labeling.

Features

Automatic Video Processing: Extracts audio from video files automatically
Speaker Diarization: Identifies and labels different speakers (Speaker A, B, C, etc.)
Smart Formatting: Groups consecutive segments by speaker with timestamps
Combined Output: Merges split audio files into a single transcript with continuous timestamps
Automatic Splitting: Handles files >25MB by automatically splitting and recombining
Multi-speaker Support: Works with conversations, meetings, interviews, etc.

Setup

1. Install Prerequisites

System Requirements:

Python 3.11+
ffmpeg (for audio extraction and splitting)
OpenAI API key with access to GPT-4o models

Install ffmpeg:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

2. Set up Python environment

# Create virtual environment
python3 -m venv .

# Activate virtual environment
source bin/activate

# Install Python dependencies
pip install -r requirements.txt

3. Set up environment variables in .env

ASSET_INPUT_DIR=/path/to/input/audio
ASSET_OUTPUT_DIR=/path/to/output/
OPENAI_API_KEY=your-api-key-here

# Optional: Custom video input directory (defaults to assets/video/)
VIDEO_INPUT_DIR=/path/to/video/input

# Optional: Test mode - stop after extraction without transcribing
TEST_MODE=false

4. Run the transcription

python3 main.py

# Or run in test mode (extract audio only, no API calls)
TEST_MODE=true python3 main.py

Usage Workflow

Place video files in assets/video/ directory (.mp4, .webm, .avi, .mov)
Run the script - it will automatically:
- Archive existing WAV files to assets/audio/_old/
- Extract audio from videos to assets/audio/ (mono, 16kHz)
- Transcribe all audio files with speaker diarization
- Move successfully transcribed videos to assets/video/_old/
Find transcripts in the output directory as OUTPUT-{filename}.txt

Output Format

Transcripts are formatted with speaker labels and timestamps for easy readability:

[00:00:15] Speaker A:
Hello everyone, thanks for joining today's meeting.
I wanted to discuss our progress on the new features.
We've made some great improvements this week.

[00:00:28] Speaker B:
That's great to hear.
Can you tell us more about the specific changes?

[00:00:35] Speaker A:
Absolutely, let me walk through them.

Format Details:

Each speaker block starts with [HH:MM:SS] Speaker X:
Consecutive segments from the same speaker are grouped together
Text lines within a speaker block have no prefix
Blank lines separate different speakers
Output files are named OUTPUT-{filename}.txt

How It Works

4-Phase Processing

Phase 1 - Archive Audio: Moves existing WAV files to assets/audio/_old/
Phase 2 - Extract Video: Extracts audio from video files using ffmpeg (mono, 16kHz)
Phase 3 - Transcribe: Processes all WAV files with OpenAI's diarization API
Phase 4 - Archive Video: Moves successfully transcribed videos to assets/video/_old/

Automatic File Splitting & Combining

Files larger than 25MB are automatically split into chunks using ffmpeg
Each chunk is transcribed with speaker diarization
All chunks are combined into a single output file with continuous timestamps
Timestamps are automatically adjusted across file boundaries

Speaker Identification

Uses OpenAI's gpt-4o-transcribe-diarize model
Automatically detects speaker changes
Labels speakers as A, B, C, etc. (generic labels)
Maintains speaker consistency across split file chunks

Important Notes

File Processing Behavior

Videos are only archived after successful transcription. If transcription fails, the video remains in place for retry on the next run.

API Costs

This tool uses OpenAI's gpt-4o-transcribe-diarize model, which may have different pricing than the standard Whisper model. Check OpenAI's pricing page for current rates.

Speaker Labeling

Speakers are automatically labeled with generic identifiers (A, B, C, etc.). The model does not identify speakers by name unless you provide reference audio samples (not currently implemented in this tool).

Manual Audio Extraction (Optional)

If you prefer to extract audio manually or need custom settings:

ffmpeg -i meeting.webm -ac 1 -ar 16000 -vn meeting.wav

-ac 1 sets the audio channels to 1 (mono)
-ar 16000 sets the audio rate to 16kHz
-vn tells ffmpeg to skip including video

Place the resulting WAV file directly in the ASSET_INPUT_DIR directory.

Manual Audio Splitting (Optional)

If needed, you can manually split audio files:

ffmpeg -i meeting.wav -f segment -segment_time 30 -c copy output%03d.wav

This creates multiple 30-second files (output000.wav, output001.wav, etc.) from your original audio.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Transcription Tool with Speaker Diarization

Features

Setup

1. Install Prerequisites

2. Set up Python environment

3. Set up environment variables in .env

4. Run the transcription

Usage Workflow

Output Format

How It Works

4-Phase Processing

Automatic File Splitting & Combining

Speaker Identification

Important Notes

File Processing Behavior

API Costs

Speaker Labeling

Manual Audio Extraction (Optional)

Manual Audio Splitting (Optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription Tool with Speaker Diarization

Features

Setup

1. Install Prerequisites

2. Set up Python environment

3. Set up environment variables in .env

4. Run the transcription

Usage Workflow

Output Format

How It Works

4-Phase Processing

Automatic File Splitting & Combining

Speaker Identification

Important Notes

File Processing Behavior

API Costs

Speaker Labeling

Manual Audio Extraction (Optional)

Manual Audio Splitting (Optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages