Stutter Detection Using Prosody

This project implements a robust speech feature extraction pipeline for automated stutter detection, focusing on prosodic and acoustic features. It utilizes the SEP-28k dataset and extracts fine-grained frame-level and clip-level features to distinguish between fluent and stuttered speech.

Results

The confusion matrix indicates balanced classification accuracy across both classes:
- 84.7% of No Stutter samples were correctly identified.
- 86.7% of Stutter samples were correctly identified.
The ROC curve achieved an AUC of 0.915, showing excellent ability to distinguish between stutter and non-stutter speech.
The prediction probability distribution shows:
- Most No Stutter samples clustered near 0.
- Most Stutter samples clustered near 1.
This indicates the model makes confident and reliable predictions with limited ambiguity.

Quick Start: Data Acquisition

To set up the project environment, run the following commands to download the dataset and pre-extracted features.

1. Download SEP-28k Dataset (Kaggle)

Requires Kaggle CLI configured with your API key.

mkdir -p dataset
kaggle datasets download -d vudominhgiang/sep-28k-maintained -p dataset/
unzip dataset/sep-28k-maintained.zip -d dataset/
rm dataset/sep-28k-maintained.zip

2. Download Extracted Features (Hugging Face)

Requires huggingface-cli.

mkdir -p output
huggingface-cli download bropal/stutter_detection_prosody --local-dir output/ --repo-type space

Manual downloads are available at:

Kaggle: SEP-28k Maintained

Hugging Face: Stutter Detection Prosody (Output)

Features & Methodology

The pipeline extracts several layers of speech features based on prosodic dynamics and spectral characteristics:

1. Vowel Onset Point (VOP) Detection

Faithful implementation of Mary & Yegnanarayana (2008), using:

LP Residual & Hilbert Envelope.
Gabor filter convolution for evidence enhancement.
Peak picking with dynamic thresholds and F0-based spurious reduction.

2. Syllable Prosody (7 Parameters)

Extracts prosodic dynamics between VOPs:

Duration: Syllable duration and voiced duration.
Intonation: Peak F0, Distance of peak from VOP ($D_p$), and F0 range ($\Delta F_0$).
Tilt: Amplitude tilt and Duration tilt parameters.
Stress: Delta Log Energy.

3. Acoustic & Spectral Features

MFCCs: 13 coefficients + $\Delta$ + $\Delta\Delta$ (39 dims).
Voice Quality: Jitter, Shimmer, CPP (Cepstral Peak Prominence).
Prosody Contours: F0 (RAPT-inspired autocorrelation), RMS Energy, Zero-Crossing Rate.
Pause Features: Silence duration, pause count, and max pause length (targeting 'Block' stutters).

This project is licensed under the GPL-3.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
Model_training.ipynb		Model_training.ipynb
README.md		README.md
extraction_testing_combined.ipynb		extraction_testing_combined.ipynb
feature_extraction.ipynb		feature_extraction.ipynb
plot.png		plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stutter Detection Using Prosody

Results

Quick Start: Data Acquisition

1. Download SEP-28k Dataset (Kaggle)

2. Download Extracted Features (Hugging Face)

Features & Methodology

1. Vowel Onset Point (VOP) Detection

2. Syllable Prosody (7 Parameters)

3. Acoustic & Spectral Features

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stutter Detection Using Prosody

Results

Quick Start: Data Acquisition

1. Download SEP-28k Dataset (Kaggle)

2. Download Extracted Features (Hugging Face)

Features & Methodology

1. Vowel Onset Point (VOP) Detection

2. Syllable Prosody (7 Parameters)

3. Acoustic & Spectral Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages