End-to-end preprocessing pipeline for Swiss-German TTS training data.
Starts from raw audio collections and produces a manifest of clean, diarized, quality-filtered audio segments enriched with:
- German (DE) Whisper transcription
- Swiss-German (CH) transcription via Gemini/OpenRouter
- Speech emotion recognition (SER) tags
- Audio pattern recognition (APR/SED) tags — laughter, breathing, cough, etc.
# Clone the repository
git clone https://github.com/i4Ds/tts_data_ch_preprocessing.git
cd tts_data_ch_preprocessing
pip install -e .raw audio
│
▼
01_diarize.py pyannote diarization only
│ JSON per file (speaker-labelled speech only, no transcription)
▼
01b_select_speakers.py Select the dominant speaker per source folder
│ speaker_selection.json
▼
02_extract_segments.py Filter selected speaker, merge, cut → manifest.jsonl
│ selected-speaker manifest.jsonl
├──────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼
03a_transcribe_de 03b_ser 03c_apr 03d_dialect
DE text SER tags audio/SED tags dialect labels
│ │ │ │
│ │ │ ▼
│ │ │ 04a_transcribe_ch_gemini.py
│ │ │ CH text via Gemini/OpenRouter
│ │ │ │
└──────────────┴──────────────┴──────────────┘
│
▼
05_merge_annotations.py Merge DE/CH/SER/APR/dialect into manifest
│
▼
06_create_final_manifest.py Final compact manifest with all annotations
│
▼
07_push_hf_dataset.py Optional Hugging Face Dataset export/push
Steps 03a / 03b / 03c / 03d run after step 02, so transcription and annotation costs are spent only on the selected speaker clips.
The SLURM test pipeline reads shared paths and output names from
slurm/test_config.sh. Set variables such as AUDIO_DIR, TEST_ROOT,
MANIFEST, MANIFEST_DIALECT, and MANIFEST_SENTENCE_CH there, or pass the
same paths explicitly with the CLI flags shown below.
python 01_diarize.py /path/to/audio \
--output-dir /path/to/jsons \
--device cuda \
--require-cuda \
--overwritepython 01b_select_speakers.py /path/to/jsons \
--output /path/to/jsons/speaker_selection.json \
--speakers-per-folder 1python 02_extract_segments.py /path/to/jsons \
--output-dir /path/to/segments \
--speaker-selection /path/to/jsons/speaker_selection.json \
--allow-empty-text \
--min-purity 0.95 \
--min-coverage 0.9 \
--min-duration 3.0 \
--max-duration 15.0 \
--quality-prefilter \
--quality-prefilter-apr--quality-prefilter runs post-cut audio checks before any transcription credits are
spent. It enables DNSMOS scoring and rejects low-background-quality segments. Add
--quality-prefilter-apr when you also want AST/APR music tagging in step 2; this
stores audio_tag_frames in manifest.jsonl and rejects music-heavy clips early.
python 03a_transcribe_de.py \
--input /path/to/segments/manifest.jsonl \
--output /path/to/segments/manifest_transcript_de.jsonl \
--language de \
--device cuda \
--require-cuda \
--overwriteAfter step 3a, use avg_logprob/word-confidence-based filtering before running
the remaining annotation jobs if you want the cheapest possible downstream pass.
python 03b_ser.py \
--input /path/to/segments/manifest.jsonl \
--output /path/to/segments/manifest_ser.jsonl \
--overwritepython 03c_apr.py \
--input /path/to/segments/manifest.jsonl \
--output /path/to/segments/manifest_apr.jsonl \
--overwriteSkip this if --quality-prefilter-apr already produced the APR frames you want in
step 2.
python 03d_dialect.py \
--input /path/to/segments/manifest.jsonl \
--output /path/to/segments/manifest_dialect.jsonlpython 04a_transcribe_ch_gemini.py \
--input /path/to/segments/manifest_dialect.jsonl \
--output /path/to/segments/manifest_sentence_ch.jsonl \
--transcript-de-jsonl /path/to/segments/manifest_transcript_de.jsonl \
--min-avg-logprob -0.5 \
--min-word-prob-mean 0.70 \
--http-referer "$OPENROUTER_HTTP_REFERER" \
--service-tier flexPassing --transcript-de-jsonl makes Gemini skip clips with weak DE transcription
confidence before any OpenRouter request is made. By default it also skips rows
missing from the DE transcript output; use --allow-missing-transcript-de only if
you intentionally want those sent to Gemini anyway.
python 04b_merge_sentence_ch.py \
/path/to/segments/manifest.jsonl \
/path/to/segments/manifest_sentence_ch.jsonl \
--output /path/to/segments/manifest_with_sentence_ch.jsonlpython 05_merge_annotations.py \
--input /path/to/segments/manifest.jsonl \
--transcript-de /path/to/segments/manifest_transcript_de.jsonl \
--sentence-ch /path/to/segments/manifest_sentence_ch.jsonl \
--ser /path/to/segments/manifest_ser.jsonl \
--apr /path/to/segments/manifest_apr.jsonl \
--dialect /path/to/segments/manifest_dialect.jsonl \
--output /path/to/segments/manifest_annotated.jsonlpython 06_create_final_manifest.py \
--input /path/to/segments/manifest_annotated.jsonl \
--output /path/to/segments/manifest_final.jsonl06_create_final_manifest.py warns, but still writes rows, when optional
emotion_frames or audio_tag_frames are missing. Missing emotion becomes
UNKNOWN; missing audio tags become an empty tags list.