Summary
The MLX Whisper decode_with_fallback logic appears to differ from upstream openai/whisper in how no_speech_threshold suppresses fallback decoding.
In upstream OpenAI Whisper, high no_speech_prob only disables fallback when the average log probability is also below logprob_threshold:
if (
no_speech_threshold is not None
and decode_result.no_speech_prob > no_speech_threshold
and logprob_threshold is not None
and decode_result.avg_logprob < logprob_threshold
):
needs_fallback = False # silence
In mlx_whisper.transcribe, the condition is broader:
if (
no_speech_threshold is not None
and decode_result.no_speech_prob > no_speech_threshold
):
needs_fallback = False # silence
Why this matters
This means MLX Whisper can accept a decode as “silence” solely because no_speech_prob is high, even if fallback was triggered for another reason such as high compression ratio / repetitive output.
The upstream behavior is narrower: it only suppresses fallback when the segment looks like silence according to both high no-speech probability and low average log probability. That may affect hallucination/repetition behavior.
References
OpenAI Whisper source:
https://github.com/openai/whisper/blob/main/whisper/transcribe.py
Relevant upstream condition is in decode_with_fallback.
MLX Whisper source:
whisper/transcribe.py, inside decode_with_fallback.
Suggested change
Match upstream OpenAI Whisper by adding the logprob_threshold and avg_logprob checks:
if (
no_speech_threshold is not None
and decode_result.no_speech_prob > no_speech_threshold
and logprob_threshold is not None
and decode_result.avg_logprob < logprob_threshold
):
needs_fallback = False # silence
Summary
The MLX Whisper
decode_with_fallbacklogic appears to differ from upstreamopenai/whisperin howno_speech_thresholdsuppresses fallback decoding.In upstream OpenAI Whisper, high
no_speech_probonly disables fallback when the average log probability is also belowlogprob_threshold:In
mlx_whisper.transcribe, the condition is broader:Why this matters
This means MLX Whisper can accept a decode as “silence” solely because
no_speech_probis high, even if fallback was triggered for another reason such as high compression ratio / repetitive output.The upstream behavior is narrower: it only suppresses fallback when the segment looks like silence according to both high no-speech probability and low average log probability. That may affect hallucination/repetition behavior.
References
OpenAI Whisper source:
https://github.com/openai/whisper/blob/main/whisper/transcribe.py
Relevant upstream condition is in
decode_with_fallback.MLX Whisper source:
whisper/transcribe.py, insidedecode_with_fallback.Suggested change
Match upstream OpenAI Whisper by adding the
logprob_thresholdandavg_logprobchecks: