tts_data_ch_preprocessing

End-to-end preprocessing pipeline for Swiss-German TTS training data.

Starts from raw audio collections and produces a manifest of clean, diarized, quality-filtered audio segments enriched with:

German (DE) Whisper transcription
Swiss-German (CH) transcription via Gemini/OpenRouter
Speech emotion recognition (SER) tags
Audio pattern recognition (APR/SED) tags — laughter, breathing, cough, etc.

Setup

# Clone the repository
git clone https://github.com/i4Ds/tts_data_ch_preprocessing.git
cd tts_data_ch_preprocessing

pip install -e .

Pipeline

raw audio
    │
    ▼
01_diarize.py                  pyannote diarization only
    │ JSON per file (speaker-labelled speech only, no transcription)
    ▼
01b_select_speakers.py          Select the dominant speaker per source folder
    │ speaker_selection.json
    ▼
02_extract_segments.py          Filter selected speaker, merge, cut → manifest.jsonl
    │ selected-speaker manifest.jsonl
    ├──────────────┬──────────────┬──────────────┐
    ▼              ▼              ▼              ▼
03a_transcribe_de  03b_ser       03c_apr        03d_dialect
DE text            SER tags      audio/SED tags dialect labels
    │              │              │              │
    │              │              │              ▼
    │              │              │   04a_transcribe_ch_gemini.py
    │              │              │   CH text via Gemini/OpenRouter
    │              │              │              │
    └──────────────┴──────────────┴──────────────┘
                               │
                               ▼
05_merge_annotations.py        Merge DE/CH/SER/APR/dialect into manifest
                               │
                               ▼
06_create_final_manifest.py    Final compact manifest with all annotations
                               │
                               ▼
07_push_hf_dataset.py          Optional Hugging Face Dataset export/push

Steps 03a / 03b / 03c / 03d run after step 02, so transcription and annotation costs are spent only on the selected speaker clips.

The SLURM test pipeline reads shared paths and output names from slurm/test_config.sh. Set variables such as AUDIO_DIR, TEST_ROOT, MANIFEST, MANIFEST_DIALECT, and MANIFEST_SENTENCE_CH there, or pass the same paths explicitly with the CLI flags shown below.

Step-by-step

Step 1 — Diarize only

python 01_diarize.py /path/to/audio \
    --output-dir /path/to/jsons \
    --device cuda \
    --require-cuda \
    --overwrite

Step 1b — Select speaker

python 01b_select_speakers.py /path/to/jsons \
    --output /path/to/jsons/speaker_selection.json \
    --speakers-per-folder 1

Step 2 — Extract segments

python 02_extract_segments.py /path/to/jsons \
    --output-dir /path/to/segments \
    --speaker-selection /path/to/jsons/speaker_selection.json \
    --allow-empty-text \
    --min-purity 0.95 \
    --min-coverage 0.9 \
    --min-duration 3.0 \
    --max-duration 15.0 \
    --quality-prefilter \
    --quality-prefilter-apr

--quality-prefilter runs post-cut audio checks before any transcription credits are spent. It enables DNSMOS scoring and rejects low-background-quality segments. Add --quality-prefilter-apr when you also want AST/APR music tagging in step 2; this stores audio_tag_frames in manifest.jsonl and rejects music-heavy clips early.

Step 3a — German transcription (parallel)

python 03a_transcribe_de.py \
    --input  /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_transcript_de.jsonl \
    --language de \
    --device cuda \
    --require-cuda \
    --overwrite

After step 3a, use avg_logprob/word-confidence-based filtering before running the remaining annotation jobs if you want the cheapest possible downstream pass.

Step 4a — Speech emotion recognition (parallel)

python 03b_ser.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_ser.jsonl \
    --overwrite

Step 4b — Audio pattern recognition (parallel)

python 03c_apr.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_apr.jsonl \
    --overwrite

Skip this if --quality-prefilter-apr already produced the APR frames you want in step 2.

Step 4c — Dialect identification (parallel)

python 03d_dialect.py \
    --input /path/to/segments/manifest.jsonl \
    --output /path/to/segments/manifest_dialect.jsonl

Step 4d — Swiss-German transcription via Gemini

python 04a_transcribe_ch_gemini.py \
    --input /path/to/segments/manifest_dialect.jsonl \
    --output /path/to/segments/manifest_sentence_ch.jsonl \
    --transcript-de-jsonl /path/to/segments/manifest_transcript_de.jsonl \
    --min-avg-logprob -0.5 \
    --min-word-prob-mean 0.70 \
    --http-referer "$OPENROUTER_HTTP_REFERER" \
    --service-tier flex

Passing --transcript-de-jsonl makes Gemini skip clips with weak DE transcription confidence before any OpenRouter request is made. By default it also skips rows missing from the DE transcript output; use --allow-missing-transcript-de only if you intentionally want those sent to Gemini anyway.

Step 4e — Merge sentence_ch only

python 04b_merge_sentence_ch.py \
    /path/to/segments/manifest.jsonl \
    /path/to/segments/manifest_sentence_ch.jsonl \
    --output /path/to/segments/manifest_with_sentence_ch.jsonl

Step 5 — Merge annotations

python 05_merge_annotations.py \
    --input /path/to/segments/manifest.jsonl \
    --transcript-de /path/to/segments/manifest_transcript_de.jsonl \
    --sentence-ch /path/to/segments/manifest_sentence_ch.jsonl \
    --ser /path/to/segments/manifest_ser.jsonl \
    --apr /path/to/segments/manifest_apr.jsonl \
    --dialect /path/to/segments/manifest_dialect.jsonl \
    --output /path/to/segments/manifest_annotated.jsonl

Step 6 — Final manifest

python 06_create_final_manifest.py \
    --input /path/to/segments/manifest_annotated.jsonl \
    --output /path/to/segments/manifest_final.jsonl

06_create_final_manifest.py warns, but still writes rows, when optional emotion_frames or audio_tag_frames are missing. Missing emotion becomes UNKNOWN; missing audio tags become an empty tags list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tts_data_ch_preprocessing

Setup

Pipeline

Step-by-step

Step 1 — Diarize only

Step 1b — Select speaker

Step 2 — Extract segments

Step 3a — German transcription (parallel)

Step 4a — Speech emotion recognition (parallel)

Step 4b — Audio pattern recognition (parallel)

Step 4c — Dialect identification (parallel)

Step 4d — Swiss-German transcription via Gemini

Step 4e — Merge sentence_ch only

Step 5 — Merge annotations

Step 6 — Final manifest

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dialect		dialect
tests		tests
vad_diarization		vad_diarization
.gitignore		.gitignore
.gitmodules		.gitmodules
01_diarize.py		01_diarize.py
01b_select_speakers.py		01b_select_speakers.py
02_extract_segments.py		02_extract_segments.py
03a_transcribe_de.py		03a_transcribe_de.py
03b_ser.py		03b_ser.py
03c_apr.py		03c_apr.py
03d_dialect.py		03d_dialect.py
04a_transcribe_ch_gemini.py		04a_transcribe_ch_gemini.py
04b_merge_sentence_ch.py		04b_merge_sentence_ch.py
05_merge_annotations.py		05_merge_annotations.py
06_create_final_manifest.py		06_create_final_manifest.py
README.md		README.md
ctc_alignment.py		ctc_alignment.py
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
srt_formatter.py		srt_formatter.py

Folders and files

Latest commit

History

Repository files navigation

tts_data_ch_preprocessing

Setup

Pipeline

Step-by-step

Step 1 — Diarize only

Step 1b — Select speaker

Step 2 — Extract segments

Step 3a — German transcription (parallel)

Step 4a — Speech emotion recognition (parallel)

Step 4b — Audio pattern recognition (parallel)

Step 4c — Dialect identification (parallel)

Step 4d — Swiss-German transcription via Gemini

Step 4e — Merge sentence_ch only

Step 5 — Merge annotations

Step 6 — Final manifest

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages