huggingface · stevhliu · May 13, 2026 · May 11, 2026 · May 12, 2026 · May 13, 2026
diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md
@@ -36,7 +36,21 @@ RUN_SLOW=1 pytest tests/models/mymodel/ -v
 
 The Hugging Face CI runs model tests without `@slow` on every pull request, and slow tests run on a nightly schedule (see [Pull request checks](./pr_checks) for what the CI validates).
 
-## Write tests for a causal language model
+## Pick a base test class
+
+Three base classes cover the most common model families. Pick the one that matches your model's modality.
+
+| Base class | Use for | Mixins |
+|---|---|---|
+| `CausalLMModelTest` | Causal language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin`, `TrainingTesterMixin`, `TensorParallelTesterMixin` |
+| `VLMModelTest` | Vision-language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin` |
+| `ALMModelTest` | Audio-language models | `ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin` |
+
+`VLMModelTest` and `ALMModelTest` share a common `MultiModalModelTest` parent that nests sub-configs into a composite top-level config and places modality placeholder tokens in `input_ids` alongside the raw modality features (audio or vision). `CausalLMModelTest` doesn't use the multimodal parent. It builds on the three shared mixins and adds `TrainingTesterMixin` and `TensorParallelTesterMixin` for training and tensor-parallel coverage.
+
+For architectures that don't fit any of the three (encoder-only, encoder-decoder, etc.), build the test infrastructure directly from the [two-class pattern](#modeltester-and-modeltest) and [test mixins](#test-mixins) described below.
+
+## CausalLMModelTest
 
 `CausalLMModelTest` is the recommended base class for testing causal language models. It inherits from five [test mixins](#test-mixins) and auto-generates tests for save/load, generation, pipelines, training, and tensor parallelism.
 
@@ -66,7 +80,7 @@ These two classes give full test coverage for `MyModel` and all its head classes
 
 `CausalLMModelTester` only requires `base_model_class`. The tester strips the `Model` suffix to get a base name (`LlamaModel` becomes `Llama`), then appends suffixes like `Config` or `ForCausalLM` to discover related classes. If a class doesn't exist in the module, the attribute stays `None` and the tester skips the corresponding tests.
 
-### Overriding defaults
+### Overriding defaults in the CausalLMTester
 
 If your model doesn't follow standard naming, or you need to customize behavior, override attributes on the tester or test class.
 
@@ -98,7 +112,7 @@ class YoutuModelTester(CausalLMModelTester):
         self.q_lora_rank = q_lora_rank
 ```
 
-## Write tests for a vision-language model
+## VLMModelTest
 
 `VLMModelTest` is the base class for vision-language models. It inherits from three mixins (`ModelTesterMixin`, `GenerationTesterMixin`, `PipelineTesterMixin`) and sets `_is_composite = True` to handle multiple sub-models.
 
@@ -134,7 +148,7 @@ class MyVLMTest(VLMModelTest, unittest.TestCase):
     model_tester_class = MyVLMTester
 ```
 
-### Overriding defaults
+### Overriding defaults in the VLMModelTester
 
 When the VLM needs custom vision parameters or non-default config values, override `__init__`. Set defaults with `setdefault` before calling `super().__init__(parent, **kwargs)`. The example below shows the first few defaults from [tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py](https://github.com/huggingface/transformers/blob/main/tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py).
 
@@ -162,6 +176,50 @@ VLM tests differ from `CausalLMModelTest` in a few ways.
 - The tester's `__init__` accepts vision parameters (`image_size`, `patch_size`, `num_channels`, `num_image_tokens`) from `**kwargs` and `setdefault()`.
 - `ConfigTester` uses `has_text_modality=False` because the top-level config is a composite config rather than a text model config.
 
+## ALMModelTest
+
+`ALMModelTest` is the base class for audio-language models (ALMs) like Qwen2Audio, AudioFlamingo3, and GraniteSpeech. It mirrors the VLM pattern with the same `MultiModalModelTest` parent and auto-discovery of head classes. The vision-side machinery is swapped for audio features, an audio sub-config, and an audio-token placement strategy.
+
+```py
+class MyALMTester(ALMModelTester):
+    config_class = MyALMConfig
+    text_config_class = MyALMTextConfig
+    audio_config_class = MyALMAudioConfig
+    conditional_generation_class = MyALMForConditionalGeneration
+    audio_mask_key = "feature_attention_mask"
+
+
+class MyALMTest(ALMModelTest, unittest.TestCase):
+    model_tester_class = MyALMTester
+```
+
+### Overriding defaults in the ALMModelTester
+
+The tester's `__init__` sets ALM-specific defaults (`feat_seq_length=128`, `num_mel_bins=80`, `audio_token_id=0`). Override them with `setdefault` before calling `super().__init__(parent, **kwargs)`.
+
+Two class attributes tell the tester how your model names things.
+
+- `audio_mask_key`: the kwarg name your model expects for the audio mask (`"feature_attention_mask"`, `"input_features_mask"`, etc.). Leave it `None` if your model doesn't consume a separate audio mask.
+- `audio_config_key`: the attribute name your top-level config uses to nest the audio sub-config. Defaults to `"audio_config"` but models like GraniteSpeech use `"encoder_config"`.
+
+```py
+class Qwen2AudioModelTester(ALMModelTester):
+    def __init__(self, parent, **kwargs):
+        kwargs.setdefault("feat_seq_length", 60)
+        kwargs.setdefault("max_source_positions", kwargs["feat_seq_length"] // 2)
+        super().__init__(parent, **kwargs)
+```
+
+`ALMModelTester` requires you to override one hook, `get_audio_embeds_mask(audio_mask)`, and exposes a few more optional ones for customization.
+
+- `get_audio_embeds_mask(audio_mask)`: returns the per-batch mask of audio embedding positions after the encoder's downsampling. The tester uses its row-wise sum to decide how many `audio_token_id` placeholders to insert into `input_ids`, so the count must match what your encoder emits.
+- `create_audio_features()`: returns the audio feature tensor. Default shape is `[batch_size, num_mel_bins, feat_seq_length]`. Override when your model, like GraniteSpeech, expects time-first features (`[batch_size, feat_seq_length, num_mel_bins]`).
+- `create_audio_mask()`: returns the audio-level attention mask. The default builds a random contiguous valid region per row in the batch. Override with a deterministic full-length mask if your tests compare two `prepare_config_and_inputs_for_common()` invocations against each other, or if your audio encoder dispatches to a backend that rejects non-null masks.
+- `place_audio_tokens(input_ids, config, num_audio_tokens)`: places audio placeholder tokens contiguously after `BOS`. Override only if your model needs a different layout.
+- `get_audio_feature_key()`: returns the inputs-dict key for audio features (`"input_features"` by default).
+
+In addition to the inherited multimodal tests, `ALMModelTest` adds `test_mismatching_num_audio_tokens`. The test asserts the model raises a clear `ValueError` when the number of audio features doesn't match the number of audio placeholder tokens in `input_ids`, and verifies that a prompt with multiple audio segments still forwards successfully.
+
 ## Write tests for other architectures
 
 For encoder-only, encoder-decoder, audio, or other non-standard architectures, build the test infrastructure directly from the two-class pattern and test mixins described below.