feat[vLLM x v5]: Expose max_source_positions on VibeVoiceAsrConfig by harshaljanjani · Pull Request #46472 · huggingface/transformers

harshaljanjani · 2026-06-07T11:47:26Z

What does this PR do?

→ Fixes vllm-project/vllm#39330 (comment)
→ Exposes max_source_positions on VibeVoiceAsrConfig so profiling can resolve the model's max audio token budget.
→ Companion to the vLLM Transformers audio backend PR. vLLM's get_max_audio_tokens looks for one of max_source_positions, max_position_embeddings or max_pos_emb and VibeVoice exposes none of these (its acoustic tokenizer is ConvNet-based with no positional embeddings) so profiling raises ValueError. With this change the same path resolves cleanly to 450 (max_source_positions = ceil(1440000 (60s @ 24kHz) / 3200) = 450, which is already in the config, is now exposed through the property).
→ Made sure this doesn't cause any regressions in tests/models/vibevoice_asr/.

cc: @eustlb @vasqu

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you fix any necessary existing tests?

vasqu

Tbh, I'm not sure whether we should instead standardize into one of those attributes over all models (we can keep alternatives for older models but should focus on future models using only one standard)

Would first like some opinion from @eustlb @ebezzam on what you think here. This will come up more and more probably

ebezzam · 2026-06-09T10:23:28Z

@harshaljanjani thanks for pointing this out!

As you've suggested, I think we should expose this attribute but I would do it differently to be consistent with existing models:

as an attribute (rather than property)
max_position_embeddings as it is seems to be more frequent than max_source_positions? and moving forward we should probably stick with this max_position_embeddings @eustlb?
like other models, add it to the encoder config of VibeVoice rather than its main modeling config

for reference, I've listed below the attributes that existing models expose:

audioflamingo3 / musicflamingo: max_source_positions
voxtral / voxtral_realtime: max_position_embeddings
glm asr: max_position_embeddings
parakeet: max_position_embeddings
cohere: max_position_embeddings
ibm: max_pos_emb

ebezzam · 2026-06-09T10:50:36Z

        super().__post_init__(**kwargs)

+    @property
+    def max_source_positions(self) -> int:


rather make it an attribute called max_position_embeddings in the encoder config like (most) other models? as mentioned here

Done, thanks! Just a small addition on the __post_init__ override given VibeVoice's value is derived from acoustic_tokenizer_chunk_size (the single source of truth in our case) unlike the other static budgets linked in the aforementioned comment. Thought we'd like to avoid cases where a user doubles chunk_size and the budget is kept silently stale; would love to know if there's a more standardized way to do it, or if you'd like the onus to be on the user to make sure of this when they pass acoustic_tokenizer_chunk_size and max_position_embeddings both :)

thanks for quick change! left some comments but would be nice to get thoughts from others :)

ebezzam · 2026-06-09T15:22:10Z

            self.text_config = CONFIG_MAPPING["qwen2"]()

+        # max_position_embeddings is derived, not a static field; refresh from chunk_size and hop_length.
+        hop_length = int(math.prod(self.acoustic_tokenizer_encoder_config.downsampling_ratios))


can self.acoustic_tokenizer_encoder_config.hop_length be used for this? see here

Yep, swapped in 1cbc688.

ebezzam · 2026-06-09T15:36:19Z

+        self.acoustic_tokenizer_encoder_config.max_position_embeddings = math.ceil(
+            self.acoustic_tokenizer_chunk_size / hop_length
+        )


although this does feel a bit weird... I don't think there's another model that has something like acoustic_tokenizer_chunk_size when calling the model forward with something different than the default (here) so we don't really have a precedent for this.

But I feel like, e.g. doubling the "budget" as mentioned here, should rather be done in the forward if a user sets acoustic_tokenizer_chunk_size to something different from the default? Maybe around here

For context (if @vasqu if taking a look), the acoustic_tokenizer_chunk_size parameter was introduced so users can adapt the GPU memory needed by the tokenizer to avoid OOM.

Acknowledged regarding the three options! Fwiw vLLM reads max_position_embeddings during profiling before any forward call; would probably set a precedent for how forward-time overrides are reflected to downstream consumers. Happy to hold the code change here until it's decided which approach is better since this is novel territory. Was also thinking that @hmellor's perspective would be valuable as well :)

Hmm, this is indeed weird. Any reason this was introduced as parameter and not via the config value directly? I think we opened a can of worms here where we now have to respect the config value and parameter no?

Also maybe the property solution might be more suited as this is kind of a dynamic value that can change easily if one of the values is set after construction, e.g. config = ...; config.acoustic_tokenizer_chunk_size = 12

it is both an argument to forward and config value.

so we could do the property approach based on the value of acoustic_tokenizer_chunk_size at construction? Iirc, you should also check the chunk size is a multiple of the hop length like this

yeah looking back we shouldn't have allowed both, and just setting at construction would have been enough.

No worries, things like these happen easily! I think we can just raise a simple value error for those cases then

Thank you both for taking the time! Please do let me know if the changes made with b6dc9 so far align with the direction I could gather from this thread; happy to make further changes as well in continuation.

github-actions · 2026-06-11T06:37:21Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: vibevoice_asr

harshaljanjani added 2 commits June 7, 2026 11:26

feat: Expose @Property on VibeVoiceAsrConfig

7af2351

fix: Drop audio_config

c922551

harshaljanjani marked this pull request as ready for review June 7, 2026 18:23

github-actions Bot requested review from ArthurZucker and Rocketknight1 June 7, 2026 18:23

vasqu reviewed Jun 8, 2026

View reviewed changes

ebezzam reviewed Jun 9, 2026

View reviewed changes

harshaljanjani added 2 commits June 9, 2026 15:23

Merge branch 'main' into feat/vibevoice-asr-max-source-positions

99f36e9

refactor: Address initial review feedback

0d54056

harshaljanjani requested a review from ebezzam June 9, 2026 12:24

ebezzam reviewed Jun 9, 2026

View reviewed changes

refactor: Use acoustic_tokenizer_encoder_config.hop_length

1cbc688

harshaljanjani requested a review from ebezzam June 9, 2026 18:53

refactor: Refactor per discussion

b6dc9cd

harshaljanjani mentioned this pull request Jun 11, 2026

feat[vLLM × v5]: Add audio support for the Transformers backend vllm-project/vllm#39330

Open

7 tasks

Conversation

harshaljanjani commented Jun 7, 2026

What does this PR do?

Code Agent Policy

Before submitting

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Jun 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants