Skip to content

audio tester class#45391

Merged
eustlb merged 42 commits into
mainfrom
tarekziade-audio-test
May 11, 2026
Merged

audio tester class#45391
eustlb merged 42 commits into
mainfrom
tarekziade-audio-test

Conversation

@tarekziade

Copy link
Copy Markdown
Collaborator

What does this PR do?

Similarly to the VLM tester, this patch introduces a audio tester class, used in

  • Qwen2Audio
  • AudioFlamingo3
  • GraniteSpeech

Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.

@tarekziade tarekziade self-assigned this Apr 13, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade

Copy link
Copy Markdown
Collaborator Author

run-slow: audioflamingo3, granite_speech, qwen2_audio

@github-actions

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"]
quantizations: []

@github-actions

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 5472faa4 workflow commit (merge commit)
PR 0817bdbd branch commit (from PR)
main a5533957 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@eustlb eustlb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!! 🔥
Models that should be covered by this PR:

  • audioflamingo3
  • glmasr
  • granite_speech
  • higgs_audio_v2
  • kyutai_speech_to_text
  • qwen2_audio
  • vibevoice_asr
  • voxtral
  • voxtral_realtime
  • musicflamingo

might:

  • gemma3n
  • gemma4
  • qwen2_5_omni
  • qwen3_omni_moe

Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py
Comment thread tests/alm_tester.py Outdated
Comment on lines +156 to +160
def get_num_audio_tokens(self, audio_features):
"""Compute number of audio placeholder tokens from features. Override for different subsampling."""
# Default: 2-stage pooling (common for Whisper-style encoders)
input_length = (audio_features.shape[-1] - 1) // 2 + 1
return (input_length - 2) // 2 + 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't put whisper defaults here but rather force sub classes to write this method

Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
Comment thread tests/alm_tester.py Outdated
@eustlb eustlb requested a review from zucchini-nlp April 27, 2026 07:46

@zucchini-nlp zucchini-nlp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks

Comment thread tests/alm_tester.py
Comment on lines +80 to +91
input_ids = input_ids.clone()
input_ids[input_ids == self.audio_token_id] = self.pad_token_id
for i in range(input_ids.shape[0]):
n = num_audio_tokens[i].item() if isinstance(num_audio_tokens, torch.Tensor) else num_audio_tokens
if 1 + int(n) > self.seq_length:
raise ValueError(
f"Cannot place {int(n)} audio tokens after BOS in a sequence of length {self.seq_length}. "
"This likely indicates a mismatch between your feature extraction/configuration and your sequence length. "
"Please ensure `seq_length` is >= the number of audio embedding positions + 1."
)
input_ids[i, 1 : 1 + int(n)] = self.audio_token_id
return input_ids

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it, allows to test different numbers of multimodal data per sample !

Comment thread tests/alm_tester.py Outdated
return {self.audio_config_key: self.get_audio_config()}

def _prepare_modality_inputs(self, input_ids, config):
# TODO: add a clear diagram that explains input prep ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for next PR?

Comment thread tests/vlm_tester.py Outdated
@tarekziade

Copy link
Copy Markdown
Collaborator Author

run-slow: audioflamingo3, gemma3, glmasr, granite_speech, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/gemma3", "models/glmasr", "models/granite_speech", "models/llava_next", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN ea576a21 workflow commit (merge commit)
PR 184227cb branch commit (from PR)
main 8c004ec6 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, gemma3, glmasr, granite_speech, granite_speech_plus, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

@eustlb eustlb added this pull request to the merge queue May 11, 2026
Merged via the queue into main with commit 83f33cd May 11, 2026
95 checks passed
@eustlb eustlb deleted the tarekziade-audio-test branch May 11, 2026 12:10
@stevhliu stevhliu mentioned this pull request May 11, 2026
jp1924 pushed a commit to jp1924/transformers that referenced this pull request May 18, 2026
* audio tester

* tweak check repo for audio tester

* audio -> ALM

* ALMTester: no audio/text defaults; better input prep

* udpate test_sdpa_can_dispatch_composite_models to hanlde ALMs

* propagate to other model classes

* cleaner

* updates

* audio_mask_key + updates

* typo

* simplify granite speech

* nits

* some more cleaning

* add test_mismatching_num_audio_tokens

* add get_placeholder_mask

* specific to musicflamingo

* granite speech fix

* let's factorise alm/vlm testers

* make fix-repo

* unskip test_sdpa_can_dispatch_on_flash on qwen2_audio

* should not be skipped

* make fix-repo

* test_mismatching_num_audio_tokens should be skipped for voxtral_realtime

* nit

* _special_token_ids as property and skipped in prepare_config_and_inputs_for_common

* MoE params in common class

* add _TEXT_MODEL_TESTER_DEFAULTS to avoid divergence

* nit

* clearer inits

* _prepare_modality_inputs return dict

* format

* split line for readability

* ran python utils/check_modular_conversion.py --fix_and_overwrite

* testing auto cancel

* testing auto cancel - part 2

* remove comment

* udpate granite speech plus tests

* fix test

---------

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants