Skip to content

bropal404/SSP_stutter_detection

Repository files navigation

Stutter Detection Using Prosody

This project implements a robust speech feature extraction pipeline for automated stutter detection, focusing on prosodic and acoustic features. It utilizes the SEP-28k dataset and extracts fine-grained frame-level and clip-level features to distinguish between fluent and stuttered speech.

Results

Results

  • The confusion matrix indicates balanced classification accuracy across both classes:

    • 84.7% of No Stutter samples were correctly identified.
    • 86.7% of Stutter samples were correctly identified.
  • The ROC curve achieved an AUC of 0.915, showing excellent ability to distinguish between stutter and non-stutter speech.

  • The prediction probability distribution shows:

    • Most No Stutter samples clustered near 0.
    • Most Stutter samples clustered near 1.
  • This indicates the model makes confident and reliable predictions with limited ambiguity.

Quick Start: Data Acquisition

To set up the project environment, run the following commands to download the dataset and pre-extracted features.

1. Download SEP-28k Dataset (Kaggle)

Requires Kaggle CLI configured with your API key.

mkdir -p dataset
kaggle datasets download -d vudominhgiang/sep-28k-maintained -p dataset/
unzip dataset/sep-28k-maintained.zip -d dataset/
rm dataset/sep-28k-maintained.zip

2. Download Extracted Features (Hugging Face)

Requires huggingface-cli.

mkdir -p output
huggingface-cli download bropal/stutter_detection_prosody --local-dir output/ --repo-type space

Manual downloads are available at:


Features & Methodology

The pipeline extracts several layers of speech features based on prosodic dynamics and spectral characteristics:

1. Vowel Onset Point (VOP) Detection

Faithful implementation of Mary & Yegnanarayana (2008), using:

  • LP Residual & Hilbert Envelope.
  • Gabor filter convolution for evidence enhancement.
  • Peak picking with dynamic thresholds and F0-based spurious reduction.

2. Syllable Prosody (7 Parameters)

Extracts prosodic dynamics between VOPs:

  • Duration: Syllable duration and voiced duration.
  • Intonation: Peak F0, Distance of peak from VOP ($D_p$), and F0 range ($\Delta F_0$).
  • Tilt: Amplitude tilt and Duration tilt parameters.
  • Stress: Delta Log Energy.

3. Acoustic & Spectral Features

  • MFCCs: 13 coefficients + $\Delta$ + $\Delta\Delta$ (39 dims).
  • Voice Quality: Jitter, Shimmer, CPP (Cepstral Peak Prominence).
  • Prosody Contours: F0 (RAPT-inspired autocorrelation), RMS Energy, Zero-Crossing Rate.
  • Pause Features: Silence duration, pause count, and max pause length (targeting 'Block' stutters).

This project is licensed under the GPL-3.0 License.

About

A robust speech feature extraction pipeline for automated stutter detection, focusing on prosodic and acoustic features, implemented without using PRAAT

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors