Skip to content

cpascariello/SpeechToText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Transcription Tool with Speaker Diarization

This tool uses OpenAI's GPT-4o transcription API to convert audio/video files to text with automatic speaker identification and labeling.

Features

  • Automatic Video Processing: Extracts audio from video files automatically
  • Speaker Diarization: Identifies and labels different speakers (Speaker A, B, C, etc.)
  • Smart Formatting: Groups consecutive segments by speaker with timestamps
  • Combined Output: Merges split audio files into a single transcript with continuous timestamps
  • Automatic Splitting: Handles files >25MB by automatically splitting and recombining
  • Multi-speaker Support: Works with conversations, meetings, interviews, etc.

Setup

1. Install Prerequisites

System Requirements:

  • Python 3.11+
  • ffmpeg (for audio extraction and splitting)
  • OpenAI API key with access to GPT-4o models

Install ffmpeg:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

2. Set up Python environment

# Create virtual environment
python3 -m venv .

# Activate virtual environment
source bin/activate

# Install Python dependencies
pip install -r requirements.txt

3. Set up environment variables in .env

ASSET_INPUT_DIR=/path/to/input/audio
ASSET_OUTPUT_DIR=/path/to/output/
OPENAI_API_KEY=your-api-key-here

# Optional: Custom video input directory (defaults to assets/video/)
VIDEO_INPUT_DIR=/path/to/video/input

# Optional: Test mode - stop after extraction without transcribing
TEST_MODE=false

4. Run the transcription

python3 main.py

# Or run in test mode (extract audio only, no API calls)
TEST_MODE=true python3 main.py

Usage Workflow

  1. Place video files in assets/video/ directory (.mp4, .webm, .avi, .mov)
  2. Run the script - it will automatically:
    • Archive existing WAV files to assets/audio/_old/
    • Extract audio from videos to assets/audio/ (mono, 16kHz)
    • Transcribe all audio files with speaker diarization
    • Move successfully transcribed videos to assets/video/_old/
  3. Find transcripts in the output directory as OUTPUT-{filename}.txt

Output Format

Transcripts are formatted with speaker labels and timestamps for easy readability:

[00:00:15] Speaker A:
Hello everyone, thanks for joining today's meeting.
I wanted to discuss our progress on the new features.
We've made some great improvements this week.

[00:00:28] Speaker B:
That's great to hear.
Can you tell us more about the specific changes?

[00:00:35] Speaker A:
Absolutely, let me walk through them.

Format Details:

  • Each speaker block starts with [HH:MM:SS] Speaker X:
  • Consecutive segments from the same speaker are grouped together
  • Text lines within a speaker block have no prefix
  • Blank lines separate different speakers
  • Output files are named OUTPUT-{filename}.txt

How It Works

4-Phase Processing

  1. Phase 1 - Archive Audio: Moves existing WAV files to assets/audio/_old/
  2. Phase 2 - Extract Video: Extracts audio from video files using ffmpeg (mono, 16kHz)
  3. Phase 3 - Transcribe: Processes all WAV files with OpenAI's diarization API
  4. Phase 4 - Archive Video: Moves successfully transcribed videos to assets/video/_old/

Automatic File Splitting & Combining

  • Files larger than 25MB are automatically split into chunks using ffmpeg
  • Each chunk is transcribed with speaker diarization
  • All chunks are combined into a single output file with continuous timestamps
  • Timestamps are automatically adjusted across file boundaries

Speaker Identification

  • Uses OpenAI's gpt-4o-transcribe-diarize model
  • Automatically detects speaker changes
  • Labels speakers as A, B, C, etc. (generic labels)
  • Maintains speaker consistency across split file chunks

Important Notes

File Processing Behavior

Videos are only archived after successful transcription. If transcription fails, the video remains in place for retry on the next run.

API Costs

This tool uses OpenAI's gpt-4o-transcribe-diarize model, which may have different pricing than the standard Whisper model. Check OpenAI's pricing page for current rates.

Speaker Labeling

Speakers are automatically labeled with generic identifiers (A, B, C, etc.). The model does not identify speakers by name unless you provide reference audio samples (not currently implemented in this tool).

Manual Audio Extraction (Optional)

If you prefer to extract audio manually or need custom settings:

ffmpeg -i meeting.webm -ac 1 -ar 16000 -vn meeting.wav
  • -ac 1 sets the audio channels to 1 (mono)
  • -ar 16000 sets the audio rate to 16kHz
  • -vn tells ffmpeg to skip including video

Place the resulting WAV file directly in the ASSET_INPUT_DIR directory.

Manual Audio Splitting (Optional)

If needed, you can manually split audio files:

ffmpeg -i meeting.wav -f segment -segment_time 30 -c copy output%03d.wav

This creates multiple 30-second files (output000.wav, output001.wav, etc.) from your original audio.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages