This project implements a robust speech feature extraction pipeline for automated stutter detection, focusing on prosodic and acoustic features. It utilizes the SEP-28k dataset and extracts fine-grained frame-level and clip-level features to distinguish between fluent and stuttered speech.
-
The confusion matrix indicates balanced classification accuracy across both classes:
- 84.7% of No Stutter samples were correctly identified.
- 86.7% of Stutter samples were correctly identified.
-
The ROC curve achieved an AUC of 0.915, showing excellent ability to distinguish between stutter and non-stutter speech.
-
The prediction probability distribution shows:
- Most No Stutter samples clustered near 0.
- Most Stutter samples clustered near 1.
-
This indicates the model makes confident and reliable predictions with limited ambiguity.
To set up the project environment, run the following commands to download the dataset and pre-extracted features.
Requires Kaggle CLI configured with your API key.
mkdir -p dataset
kaggle datasets download -d vudominhgiang/sep-28k-maintained -p dataset/
unzip dataset/sep-28k-maintained.zip -d dataset/
rm dataset/sep-28k-maintained.zipRequires huggingface-cli.
mkdir -p output
huggingface-cli download bropal/stutter_detection_prosody --local-dir output/ --repo-type spaceManual downloads are available at:
The pipeline extracts several layers of speech features based on prosodic dynamics and spectral characteristics:
Faithful implementation of Mary & Yegnanarayana (2008), using:
- LP Residual & Hilbert Envelope.
- Gabor filter convolution for evidence enhancement.
- Peak picking with dynamic thresholds and F0-based spurious reduction.
Extracts prosodic dynamics between VOPs:
- Duration: Syllable duration and voiced duration.
-
Intonation: Peak F0, Distance of peak from VOP (
$D_p$ ), and F0 range ($\Delta F_0$ ). - Tilt: Amplitude tilt and Duration tilt parameters.
- Stress: Delta Log Energy.
-
MFCCs: 13 coefficients +
$\Delta$ +$\Delta\Delta$ (39 dims). - Voice Quality: Jitter, Shimmer, CPP (Cepstral Peak Prominence).
- Prosody Contours: F0 (RAPT-inspired autocorrelation), RMS Energy, Zero-Crossing Rate.
- Pause Features: Silence duration, pause count, and max pause length (targeting 'Block' stutters).
This project is licensed under the GPL-3.0 License.
