Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4cd44e4
add a base model class
eustlb Apr 20, 2026
79b833b
ensure BC via conversion mapping
eustlb Apr 20, 2026
57436db
auto classes
eustlb Apr 20, 2026
1bd269c
test updates
eustlb Apr 20, 2026
9010366
ensure BC for accessing attributes
eustlb Apr 22, 2026
af429ef
simplify conversion mapping
eustlb Apr 23, 2026
e5eb9a6
Merge branch 'main' into alm-base-model-class
eustlb May 11, 2026
1da57e8
convert modular
eustlb May 11, 2026
483ea27
Merge remote-tracking branch 'origin/main' into alm-base-model-class
eustlb May 11, 2026
7481402
convert modular
eustlb May 11, 2026
cf7c5f1
apply to voxtral
eustlb May 11, 2026
83799ce
convert modular
eustlb May 11, 2026
464fae5
remove test_model_base_model_prefix overwrite
eustlb May 11, 2026
744567c
make
eustlb May 11, 2026
687b693
make
eustlb May 12, 2026
ff8b9f5
XXXModel class in doc
eustlb May 12, 2026
753b755
add GraniteSpeechPlusModel
eustlb May 12, 2026
ecce440
tests now have a base_model_class
eustlb May 12, 2026
1f5e199
fix qwen2audio
eustlb May 12, 2026
6a1166b
class level base mappings
eustlb May 12, 2026
674e64c
Merge branch 'main' into alm-base-model-class
eustlb May 12, 2026
933f8c6
fix test_reverse_loading_mapping
eustlb May 12, 2026
642bdff
Merge branch 'alm-base-model-class' of github.com:huggingface/transfo…
eustlb May 12, 2026
5383313
fix
eustlb May 12, 2026
44ab9cd
fix
eustlb May 12, 2026
9e0c956
Merge branch 'main' into alm-base-model-class
eustlb May 13, 2026
13656cc
nit
eustlb May 13, 2026
f8f99f0
fix
eustlb May 13, 2026
6996410
Merge branch 'alm-base-model-class' of github.com:huggingface/transfo…
eustlb May 13, 2026
5184f47
deprec in 5.7 -> 5.15
eustlb May 13, 2026
b272b07
fix flaky test
eustlb May 13, 2026
71c4cb5
fix flaky test
eustlb May 13, 2026
4414cf4
Merge branch 'main' into alm-base-model-class
eustlb May 14, 2026
59c99d5
fix flaky test
eustlb May 14, 2026
54e11e4
remove forward_base_model_attrs
eustlb May 18, 2026
241f6f8
_supports_attention_backend in PretrainedModel
eustlb May 18, 2026
d75fca8
removed redundant get/set_input_embeddings
eustlb May 18, 2026
9cb259b
remove requires_grad_ handling
eustlb May 18, 2026
c8b7788
make fix-repo
eustlb May 18, 2026
8c51b43
fix voxtral realtime
eustlb May 18, 2026
26329fc
update vibevoice_asr test
eustlb May 18, 2026
5519f41
update test
eustlb May 18, 2026
8046a76
Merge branch 'main' into alm-base-model-class
eustlb May 18, 2026
c4ac740
Update src/transformers/models/audioflamingo3/modular_audioflamingo3.py
eustlb May 20, 2026
daad5ec
Update src/transformers/models/audioflamingo3/modular_audioflamingo3.py
eustlb May 20, 2026
b09349d
Update src/transformers/models/glmasr/modeling_glmasr.py
eustlb May 20, 2026
a59f750
Merge remote-tracking branch 'origin/main' into alm-base-model-class
eustlb May 20, 2026
6f5be63
make fix-repo
eustlb May 20, 2026
159819c
Merge branch 'main' into alm-base-model-class
eustlb May 20, 2026
80306c6
Merge remote-tracking branch 'origin' into alm-base-model-class
eustlb May 20, 2026
157f7a2
Merge branch 'main' into alm-base-model-class
eustlb May 21, 2026
401cf5a
Merge branch 'main' into alm-base-model-class
eustlb May 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/audioflamingo3.md
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,11 @@ are forwarded, so you can tweak padding or tensor formats just like when calling
[[autodoc]] AudioFlamingo3Encoder
- forward

## AudioFlamingo3Model

[[autodoc]] AudioFlamingo3Model
- forward

## AudioFlamingo3ForConditionalGeneration

[[autodoc]] AudioFlamingo3ForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/glmasr.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,11 @@ assert decoded_outputs == EXPECTED_OUTPUT
[[autodoc]] GlmAsrEncoder
- forward

## GlmAsrModel

[[autodoc]] GlmAsrModel
- forward

## GlmAsrForConditionalGeneration

[[autodoc]] GlmAsrForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/granite_speech.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,11 @@ for i, transcription in enumerate(transcriptions):

[[autodoc]] GraniteSpeechFeatureExtractor

## GraniteSpeechModel

[[autodoc]] GraniteSpeechModel
- forward

## GraniteSpeechForConditionalGeneration

[[autodoc]] GraniteSpeechForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/granite_speech_plus.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,11 @@ for k in range(NUM_SEGMENTS):

[[autodoc]] GraniteSpeechPlusEncoderConfig

## GraniteSpeechPlusModel

[[autodoc]] GraniteSpeechPlusModel
- forward

## GraniteSpeechPlusForConditionalGeneration

[[autodoc]] GraniteSpeechPlusForConditionalGeneration
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/hyperclovax.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.

-->
*This model was released on 2025-07-21 and added to Hugging Face Transformers on 2026-05-06.*
*This model was released on 2025-07-21 and added to Hugging Face Transformers on 2026-05-08.*

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/musicflamingo.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,11 @@ loss.backward()

[[autodoc]] MusicFlamingoProcessor

## MusicFlamingoModel

[[autodoc]] MusicFlamingoModel
- forward

## MusicFlamingoForConditionalGeneration

[[autodoc]] MusicFlamingoForConditionalGeneration
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/pe_audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.

-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
*This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.*

# PE Audio

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/pe_audio_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.

-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
*This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.*

# PE Audio Video

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/pe_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.

-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.*
*This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.*

# PE Video

Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/qwen2_audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,11 @@ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_
[[autodoc]] Qwen2AudioEncoder
- forward

## Qwen2AudioModel

[[autodoc]] Qwen2AudioModel
- forward

## Qwen2AudioForConditionalGeneration

[[autodoc]] Qwen2AudioForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/vibevoice_asr.md
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,11 @@ print(transcription)
- apply_transcription_request
- decode

## VibeVoiceAsrModel

[[autodoc]] VibeVoiceAsrModel
- forward

## VibeVoiceAsrForConditionalGeneration

[[autodoc]] VibeVoiceAsrForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/voxtral.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,11 @@ This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb)
[[autodoc]] VoxtralEncoder
- forward

## VoxtralModel

[[autodoc]] VoxtralModel
- forward

## VoxtralForConditionalGeneration

[[autodoc]] VoxtralForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/voxtral_realtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,11 @@ This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb)
[[autodoc]] VoxtralRealtimeEncoder
- forward

## VoxtralRealtimeModel

[[autodoc]] VoxtralRealtimeModel
- forward

## VoxtralRealtimeForConditionalGeneration

[[autodoc]] VoxtralRealtimeForConditionalGeneration
Expand Down
46 changes: 46 additions & 0 deletions src/transformers/conversion_mapping.py

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,14 @@
"vipllava": "llava",
"mistral3": "llava",
"pp_chart2table": "llava",
"voxtral": "qwen2_audio",
"voxtral_realtime": "qwen2_audio",
"audioflamingo3": "qwen2_audio",
"glmasr": "qwen2_audio",
"musicflamingo": "qwen2_audio",
"granite_speech_plus": "granite_speech",
"gemma3n_text": "qwen3_5_text",
"qwen3_5_moe_text": "qwen3_5_text",
"llava_next_video": "llava_next",
"llava_onevision": "llava_next",
# class-based mappings
Expand All @@ -103,6 +111,12 @@
"LlavaOnevisionModel": "LlavaModel",
"FuyuModel": "LlavaModel",
"MllamaModel": "LlavaModel",
"VoxtralModel": "Qwen2AudioModel",
"VoxtralRealtimeModel": "Qwen2AudioModel",
"AudioFlamingo3Model": "Qwen2AudioModel",
"GlmAsrModel": "Qwen2AudioModel",
"MusicFlamingoModel": "Qwen2AudioModel",
"GraniteSpeechPlusModel": "GraniteSpeechModel",
"MaskFormerDetrDecoder": "DetrModel",
"Qwen2_5_VLForConditionalGeneration": "Qwen2VLForConditionalGeneration",
# ViT-style vision models (old HuggingFace checkpoint format → new modular format)
Expand Down Expand Up @@ -420,6 +434,38 @@ def _build_checkpoint_conversion_mapping():
WeightRenaming(source_patterns=r"^vision_tower", target_patterns="model.vision_tower"),
WeightRenaming(source_patterns=r"^multi_modal_projector", target_patterns="model.multi_modal_projector"),
],
"qwen2_audio": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="model.language_model"),
WeightRenaming(source_patterns=r"^language_model.lm_head", target_patterns="lm_head"),
WeightRenaming(source_patterns=r"^audio_tower", target_patterns="model.audio_tower"),
WeightRenaming(source_patterns=r"^multi_modal_projector", target_patterns="model.multi_modal_projector"),
],
"Qwen2AudioModel": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="language_model"),
],
"granite_speech": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="model.language_model"),
WeightRenaming(source_patterns=r"^language_model.lm_head", target_patterns="lm_head"),
WeightRenaming(source_patterns=r"^encoder", target_patterns="model.encoder"),
WeightRenaming(source_patterns=r"^projector", target_patterns="model.projector"),
],
"GraniteSpeechModel": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="language_model"),
],
"vibevoice_asr": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="model.language_model"),
WeightRenaming(source_patterns=r"^language_model.lm_head", target_patterns="lm_head"),
WeightRenaming(
source_patterns=r"^acoustic_tokenizer_encoder", target_patterns="model.acoustic_tokenizer_encoder"
),
WeightRenaming(
source_patterns=r"^semantic_tokenizer_encoder", target_patterns="model.semantic_tokenizer_encoder"
),
WeightRenaming(source_patterns=r"^multi_modal_projector", target_patterns="model.multi_modal_projector"),
],
"VibeVoiceAsrModel": [
WeightRenaming(source_patterns=r"^language_model.model", target_patterns="language_model"),
],
"llava_next": [
WeightRenaming(source_patterns=r"^language_model.lm_head", target_patterns="lm_head"),
WeightRenaming(source_patterns=r"^language_model", target_patterns="model.language_model"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ class AudioFlamingo3Config(PreTrainedConfig):
audio_token_id: int = 151669
projector_hidden_act: str = "gelu"
projector_bias: bool = True
tie_word_embeddings: bool = True

def __post_init__(self, **kwargs):
if isinstance(self.audio_config, dict):
Expand Down
Loading
Loading