audio tester class by tarekziade · Pull Request #45391 · huggingface/transformers

tarekziade · 2026-04-13T06:32:49Z

What does this PR do?

Similarly to the VLM tester, this patch introduces a audio tester class, used in

Qwen2Audio
AudioFlamingo3
GraniteSpeech

Adding a new audio-language model using this will require ~8-20 lines for the tester (vs ~100-160 before). The boilerplate (config introspection, input preparation, SDPA dispatch test, common skips) lives in one place.

HuggingFaceDocBuilderDev · 2026-04-13T06:42:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tarekziade · 2026-04-13T07:18:06Z

run-slow: audioflamingo3, granite_speech, qwen2_audio

github-actions · 2026-04-13T07:19:27Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/granite_speech", "models/qwen2_audio"]
quantizations: []

github-actions · 2026-04-13T07:32:46Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	5472faa4	workflow commit (merge commit)
PR	0817bdbd	branch commit (from PR)
main	a5533957	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

eustlb

This is cool!! 🔥
Models that should be covered by this PR:

audioflamingo3
glmasr
granite_speech
higgs_audio_v2
kyutai_speech_to_text
qwen2_audio
vibevoice_asr
voxtral
voxtral_realtime
musicflamingo

might:

gemma3n
gemma4
qwen2_5_omni
qwen3_omni_moe

eustlb · 2026-04-13T10:48:34Z

+    def get_num_audio_tokens(self, audio_features):
+        """Compute number of audio placeholder tokens from features. Override for different subsampling."""
+        # Default: 2-stage pooling (common for Whisper-style encoders)
+        input_length = (audio_features.shape[-1] - 1) // 2 + 1
+        return (input_length - 2) // 2 + 1


we shouldn't put whisper defaults here but rather force sub classes to write this method

zucchini-nlp

Great work, thanks

zucchini-nlp · 2026-04-27T09:17:19Z

+        input_ids = input_ids.clone()
+        input_ids[input_ids == self.audio_token_id] = self.pad_token_id
+        for i in range(input_ids.shape[0]):
+            n = num_audio_tokens[i].item() if isinstance(num_audio_tokens, torch.Tensor) else num_audio_tokens
+            if 1 + int(n) > self.seq_length:
+                raise ValueError(
+                    f"Cannot place {int(n)} audio tokens after BOS in a sequence of length {self.seq_length}. "
+                    "This likely indicates a mismatch between your feature extraction/configuration and your sequence length. "
+                    "Please ensure `seq_length` is >= the number of audio embedding positions + 1."
+                )
+            input_ids[i, 1 : 1 + int(n)] = self.audio_token_id
+        return input_ids


i like it, allows to test different numbers of multimodal data per sample !

zucchini-nlp · 2026-04-27T09:17:56Z

+        return {self.audio_config_key: self.get_audio_config()}
+
+    def _prepare_modality_inputs(self, input_ids, config):
+        # TODO: add a clear diagram that explains input prep ?


TODO for next PR?

tarekziade · 2026-05-04T13:13:32Z

run-slow: audioflamingo3, gemma3, glmasr, granite_speech, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

github-actions · 2026-05-04T13:15:13Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3", "models/gemma3", "models/glmasr", "models/granite_speech", "models/llava_next", "models/musicflamingo", "models/qwen2_5_omni", "models/qwen2_audio", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/vibevoice_asr", "models/voxtral", "models/voxtral_realtime"]
quantizations: []

github-actions · 2026-05-04T15:31:25Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	ea576a21	workflow commit (merge commit)
PR	184227cb	branch commit (from PR)
main	8c004ec6	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

github-actions · 2026-05-11T10:17:04Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, gemma3, glmasr, granite_speech, granite_speech_plus, llava_next, musicflamingo, qwen2_5_omni, qwen2_audio, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, vibevoice_asr, voxtral, voxtral_realtime

* audio tester * tweak check repo for audio tester * audio -> ALM * ALMTester: no audio/text defaults; better input prep * udpate test_sdpa_can_dispatch_composite_models to hanlde ALMs * propagate to other model classes * cleaner * updates * audio_mask_key + updates * typo * simplify granite speech * nits * some more cleaning * add test_mismatching_num_audio_tokens * add get_placeholder_mask * specific to musicflamingo * granite speech fix * let's factorise alm/vlm testers * make fix-repo * unskip test_sdpa_can_dispatch_on_flash on qwen2_audio * should not be skipped * make fix-repo * test_mismatching_num_audio_tokens should be skipped for voxtral_realtime * nit * _special_token_ids as property and skipped in prepare_config_and_inputs_for_common * MoE params in common class * add _TEXT_MODEL_TESTER_DEFAULTS to avoid divergence * nit * clearer inits * _prepare_modality_inputs return dict * format * split line for readability * ran python utils/check_modular_conversion.py --fix_and_overwrite * testing auto cancel * testing auto cancel - part 2 * remove comment * udpate granite speech plus tests * fix test --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

audio tester

3562c7f

tarekziade requested review from eustlb and zucchini-nlp April 13, 2026 06:32

tarekziade self-assigned this Apr 13, 2026

tweak check repo for audio tester

0817bdb

eustlb reviewed Apr 13, 2026

View reviewed changes

audio -> ALM

356c922

zucchini-nlp reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/alm_tester.py Outdated

zucchini-nlp reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/alm_tester.py Outdated

eustlb added 11 commits April 13, 2026 17:38

ALMTester: no audio/text defaults; better input prep

9663a8e

Merge branch 'main' into tarekziade-audio-test

73c4548

udpate test_sdpa_can_dispatch_composite_models to hanlde ALMs

a599b1d

propagate to other model classes

a7d54dc

cleaner

a302c3e

updates

8fcba58

audio_mask_key + updates

66acc9e

typo

63ca77e

simplify granite speech

7588135

nits

41fed1c

some more cleaning

e5971c7

eustlb mentioned this pull request Apr 21, 2026

🚨 [ALM] Add base model without head #45534

Merged

12 tasks

eustlb added 5 commits April 21, 2026 17:57

add test_mismatching_num_audio_tokens

59703dd

add get_placeholder_mask

6a67f32

specific to musicflamingo

b59f958

granite speech fix

bb986b6

let's factorise alm/vlm testers

670c68c

eustlb added 3 commits April 27, 2026 16:28

nit

95b1f20

clearer inits

c2aa666

_prepare_modality_inputs return dict

5e36c9f

eustlb requested a review from zucchini-nlp April 27, 2026 07:46

zucchini-nlp approved these changes Apr 27, 2026

View reviewed changes

tarekziade added 2 commits May 4, 2026 14:25

Merge branch 'main' into tarekziade-audio-test

ca5ff0b

format

184227c

split line for readability

d77fbb9

ran python utils/check_modular_conversion.py --fix_and_overwrite

902dbba

tarekziade added 3 commits May 5, 2026 12:37

testing auto cancel

dcdead1

testing auto cancel - part 2

628343d

Merge branch 'main' into tarekziade-audio-test

4c35768

tarekziade mentioned this pull request May 7, 2026

audio tester class tarekziade/tarekziade-transformers-reviewer-test#17

Open

eustlb added 4 commits May 11, 2026 18:00

Merge branch 'main' into tarekziade-audio-test

3f5f4d5

remove comment

c1a4772

udpate granite speech plus tests

9322315

fix test

95da798

eustlb added this pull request to the merge queue May 11, 2026

Merged via the queue into main with commit 83f33cd May 11, 2026
95 checks passed

eustlb deleted the tarekziade-audio-test branch May 11, 2026 12:10

stevhliu mentioned this pull request May 11, 2026

[docs] ALMModelTest #45900

Merged

Conversation

tarekziade commented Apr 13, 2026

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 13, 2026

Uh oh!

tarekziade commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

CI Results

Commit Info

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eustlb Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tarekziade commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

CI Results

Commit Info

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants