Skip to content

i4Ds/tts_data_ch_preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tts_data_ch_preprocessing

End-to-end preprocessing pipeline for Swiss-German TTS training data.

Starts from raw audio collections and produces a manifest of clean, diarized, quality-filtered audio segments enriched with:

  • German (DE) Whisper transcription
  • Swiss-German (CH) transcription via Gemini/OpenRouter
  • Speech emotion recognition (SER) tags
  • Audio pattern recognition (APR/SED) tags — laughter, breathing, cough, etc.

Setup

# Clone the repository
git clone https://github.com/i4Ds/tts_data_ch_preprocessing.git
cd tts_data_ch_preprocessing

pip install -e .

Pipeline

raw audio
    │
    ▼
01_diarize.py                  pyannote diarization only
    │ JSON per file (speaker-labelled speech only, no transcription)
    ▼
01b_select_speakers.py          Select the dominant speaker per source folder
    │ speaker_selection.json
    ▼
02_extract_segments.py          Filter selected speaker, merge, cut → manifest.jsonl
    │ selected-speaker manifest.jsonl
    ├──────────────┬──────────────┬──────────────┐
    ▼              ▼              ▼              ▼
03a_transcribe_de  03b_ser       03c_apr        03d_dialect
DE text            SER tags      audio/SED tags dialect labels
    │              │              │              │
    │              │              │              ▼
    │              │              │   04a_transcribe_ch_gemini.py
    │              │              │   CH text via Gemini/OpenRouter
    │              │              │              │
    └──────────────┴──────────────┴──────────────┘
                               │
                               ▼
05_merge_annotations.py        Merge DE/CH/SER/APR/dialect into manifest
                               │
                               ▼
06_create_final_manifest.py    Final compact manifest with all annotations
                               │
                               ▼
07_push_hf_dataset.py          Optional Hugging Face Dataset export/push

Steps 03a / 03b / 03c / 03d run after step 02, so transcription and annotation costs are spent only on the selected speaker clips.

The SLURM test pipeline reads shared paths and output names from slurm/test_config.sh. Set variables such as AUDIO_DIR, TEST_ROOT, MANIFEST, MANIFEST_DIALECT, and MANIFEST_SENTENCE_CH there, or pass the same paths explicitly with the CLI flags shown below.

Step-by-step

Step 1 — Diarize only

python 01_diarize.py /path/to/audio \
    --output-dir /path/to/jsons \
    --device cuda \
    --require-cuda \
    --overwrite

Step 1b — Select speaker

python 01b_select_speakers.py /path/to/jsons \
    --output /path/to/jsons/speaker_selection.json \
    --speakers-per-folder 1

Step 2 — Extract segments

python 02_extract_segments.py /path/to/jsons \
    --output-dir /path/to/segments \
    --speaker-selection /path/to/jsons/speaker_selection.json \
    --allow-empty-text \
    --min-purity 0.95 \
    --min-coverage 0.9 \
    --min-duration 3.0 \
    --max-duration 15.0 \
    --quality-prefilter \
    --quality-prefilter-apr

--quality-prefilter runs post-cut audio checks before any transcription credits are spent. It enables DNSMOS scoring and rejects low-background-quality segments. Add --quality-prefilter-apr when you also want AST/APR music tagging in step 2; this stores audio_tag_frames in manifest.jsonl and rejects music-heavy clips early.

Step 3a — German transcription (parallel)

python 03a_transcribe_de.py \
    --input  /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_transcript_de.jsonl \
    --language de \
    --device cuda \
    --require-cuda \
    --overwrite

After step 3a, use avg_logprob/word-confidence-based filtering before running the remaining annotation jobs if you want the cheapest possible downstream pass.

Step 4a — Speech emotion recognition (parallel)

python 03b_ser.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_ser.jsonl \
    --overwrite

Step 4b — Audio pattern recognition (parallel)

python 03c_apr.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_apr.jsonl \
    --overwrite

Skip this if --quality-prefilter-apr already produced the APR frames you want in step 2.

Step 4c — Dialect identification (parallel)

python 03d_dialect.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_dialect.jsonl

Step 4d — Swiss-German transcription via Gemini

python 04a_transcribe_ch_gemini.py \
    --input /path/to/segments/manifest_dialect.jsonl \
    --output /path/to/segments/manifest_sentence_ch.jsonl \
    --transcript-de-jsonl /path/to/segments/manifest_transcript_de.jsonl \
    --min-avg-logprob -0.5 \
    --min-word-prob-mean 0.70 \
    --http-referer "$OPENROUTER_HTTP_REFERER" \
    --service-tier flex

Passing --transcript-de-jsonl makes Gemini skip clips with weak DE transcription confidence before any OpenRouter request is made. By default it also skips rows missing from the DE transcript output; use --allow-missing-transcript-de only if you intentionally want those sent to Gemini anyway.

Step 4e — Merge sentence_ch only

python 04b_merge_sentence_ch.py \
    /path/to/segments/manifest.jsonl \
    /path/to/segments/manifest_sentence_ch.jsonl \
    --output /path/to/segments/manifest_with_sentence_ch.jsonl

Step 5 — Merge annotations

python 05_merge_annotations.py \
    --input /path/to/segments/manifest.jsonl \
    --transcript-de /path/to/segments/manifest_transcript_de.jsonl \
    --sentence-ch /path/to/segments/manifest_sentence_ch.jsonl \
    --ser /path/to/segments/manifest_ser.jsonl \
    --apr /path/to/segments/manifest_apr.jsonl \
    --dialect /path/to/segments/manifest_dialect.jsonl \
    --output /path/to/segments/manifest_annotated.jsonl

Step 6 — Final manifest

python 06_create_final_manifest.py \
    --input /path/to/segments/manifest_annotated.jsonl \
    --output /path/to/segments/manifest_final.jsonl

06_create_final_manifest.py warns, but still writes rows, when optional emotion_frames or audio_tag_frames are missing. Missing emotion becomes UNKNOWN; missing audio tags become an empty tags list.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages