This repository extends the work from
Disfluency Detection from Untranscribed Speech
by Kumar et al., which introduces a multimodal framework for detecting disfluencies using both audio and text signals.
Disfluency detection plays an important role in studying conversational patterns, speech disorders, hesitations, and natural dialogue modeling.
FluencyBank contains detailed manual annotations of several disfluency types at the word level, making it suitable for fine-grained token classification tasks.
We follow the same five disfluency categories defined in the original paper:
-
FP — Filled Pause:
Hesitations such as “uh”, “um”. -
RP — Repetition:
Immediate word or phrase repetitions (e.g., “I I think…”). -
RV — Revision:
Corrections or alterations to previous speech (e.g., “Take the red—uh, blue one”). -
RS — Restart:
Sentence restarts where the speaker abandons an initial structure. -
PW — Partial Word:
Word fragments or cut-off beginnings (e.g., “go—going”).
Although all five labels are supported by our model, the FluencyBank test set contains only four of them — FP, RP, RV, and PW.
The RS (restart) class has no positive examples in this dataset split, so its evaluation score remains zero.
The original project provides a BERT-based text model trained on the Switchboard corpus.
🚀 Our Contribution: In this work, we adapt and fine-tune this text model on the FluencyBank dataset to improve cross-domain generalization.
git clone https://github.com/HoseinRanjbar/disfluency_detection.gitThe model is initialized using the publicly released Switchboard-trained language model weights, ensuring continuity with the original methodology.
!gdown --id 1GQIXgCSF3Usiuy5hkxgOl483RPX3f_SX -O checkpoints/language.pt- Unzip the dataset
- Mapped word-level labels to BERT subwords
- Created speaker-independent train/validation/test splits
- Handled variable-length segments and alignment issues
!unzip ./data/FluencyBank_TimeStamped.zip -d ./data/
!python utils/split_dataset.py \
--metadata_path data/FluencyBank_TimeStamped/metadata.csv \
--train_ratio 0.80 \
--test_ratio 0.10 \
--val_ratio 0.10This creates:
data/split/train_metadata.csv
data/split/val_metadata.csv
data/split/test_metadata.csvTraining design:
- BCEWithLogitsLoss for multi-label prediction
- Step-based evaluation (every 100 training steps)
- Model selection based on Unweighted Average Recall (UAR)
- Early stopping using patience in evaluation steps
- Final checkpoint selected using best dev UAR
!python text_based_model/train.py \
--train_metadata_path data/split/train_metadata.csv \
--val_metadata_path data/split/val_metadata.csv \
--word_dir data/FluencyBank_TimeStamped/csvs/csvs \
--output_weights checkpoints/language_fluencybank.pt \
--init_weights checkpoints/language.pt \
--batch_size 16 \
--lr 5e-5 \
--eval_every 100 \
--patience_evals 10📥 Download the Fine-Tuned Model Weights
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="HosseinRanjbar/disfluency_detection",
filename="language_model_fluencybank.pth"
)On the held-out FluencyBank test set, we compute:
- Precision
- Recall
- F1 per disfluency category
- Macro-F1
- UAR (Unweighted Average Recall)
!python text_based_model/test.py \
--metadata_path data/split/test_metadata.csv \
--word_dir data/FluencyBank_TimeStamped/csvs/csvs \
--weights_path checkpoints/language_fluencybank.pt \
--batch_size 16Fine-tuning on FluencyBank yielded consistent improvements across most disfluency categories, especially for high-frequency classes.
| Class | Precision (Before) | Recall (Before) | F1 (Before) | Precision (After) | Recall (After) | F1 (After) |
|---|---|---|---|---|---|---|
| FP | 0.9726 | 1.0000 | 0.9861 | 0.9930 | 1.0000 | 0.9965 |
| RP | 0.9545 | 0.5833 | 0.7241 | 0.8976 | 0.9293 | 0.9132 |
| RV | 0.3197 | 0.6258 | 0.4232 | 0.5798 | 0.4233 | 0.4894 |
| PW | 0.8000 | 0.7926 | 0.7963 | 0.9393 | 0.9263 | 0.9327 |
| ND | 0.9821 | 0.9739 | 0.9780 | 0.9819 | 0.9885 | 0.9852 |
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| Macro Recall | 0.6626 | 0.7112 |
| Macro F1 | 0.6513 | 0.7195 |
Disfluency annotations are available only for the FluencyBank English Voices–AWS Corpus, which is open-source. The original dataset provides interview and reading videos of individuals who stutter. We’ve put significant effort into downloading the source files, converting them to .wav, and uploading the processed version to our HuggingFace dataset.
You can download the audio files using the command below:
!gdown https://huggingface.co/datasets/HosseinRanjbar/FluencyBank/resolve/main/voice.zipThe model is initialized using the publicly released Switchboard-trained acoustic model weights, ensuring continuity with the original methodology.
!gdown --id 1wWrmopvvdhlBw-cL7EDyih9zn_IJu5Wr -O checkpoints/acoustic.pt!python /acoustic_based_model/demo.py --audio_path /voice-example/24fb.wav \
--metadata_path data/split/test_metadata.csv \
--word_dir data/FluencyBank_TimeStamped/csvs/csvs \
--weights_path checkpoints/language_fluencybank.pt Training design:
- BCEWithLogitsLoss for multi-label prediction
- Step-based evaluation (every N training steps)
- Model selection based on UAR
- Early stopping using patience in evaluation steps
- Final checkpoint selected using best dev UAR
!python acoustic_model/train.py \
--train_metadata_path data/split/train_metadata.csv \
--dev_metadata_path data/split/val_metadata.csv \
--audio_dir data/FluencyBank_Wav \
--word_dir data/FluencyBank_TimeStamped/csvs/csvs \
--output_weights checkpoints/acoustic_fluencybank.pt \
--num_epochs 15 \
--lr 1e-5 \
--patience_epochs 5 \
--device cuda📥 Download the Fine-Tuned Model Weights
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="HosseinRanjbar/disfluency_detection",
filename="acoustic_model_ff.pth"
)!python acoustic_based_model/test.py \
--audio_dir data/FluencyBank_Wav \
--metadata_path data/split/test_metadata.csv \
--word_dir data/FluencyBank_TimeStamped/csvs/csvs \
--weights_path checkpoints/acoustic_fluencybank.pt \
--device cuda \
--threshold 0.5| Class | Precision (Before) | Recall (Before) | F1 (Before) | Precision (After) | Recall (After) | F1 (After) |
|---|---|---|---|---|---|---|
| FP | 0.4686 | 0.7246 | 0.5691 | 0.4991 | 0.7939 | 0.6129 |
| RP | 0.2766 | 0.3132 | 0.2938 | 0.2674 | 0.5088 | 0.3505 |
| RV | 0.0947 | 0.4874 | 0.1586 | 0.2144 | 0.1685 | 0.1887 |
| PW | 0.1846 | 0.3336 | 0.2377 | 0.1695 | 0.4160 | 0.2409 |
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| UAR | 0.3718 | 0.3774 |
| Macro F1 | 0.2518 | 0.2786 |