feat[vLLM x v5]: Expose max_source_positions on VibeVoiceAsrConfig#46472
feat[vLLM x v5]: Expose max_source_positions on VibeVoiceAsrConfig#46472harshaljanjani wants to merge 6 commits into
Conversation
vasqu
left a comment
There was a problem hiding this comment.
Tbh, I'm not sure whether we should instead standardize into one of those attributes over all models (we can keep alternatives for older models but should focus on future models using only one standard)
Would first like some opinion from @eustlb @ebezzam on what you think here. This will come up more and more probably
|
@harshaljanjani thanks for pointing this out! As you've suggested, I think we should expose this attribute but I would do it differently to be consistent with existing models:
for reference, I've listed below the attributes that existing models expose:
|
| super().__post_init__(**kwargs) | ||
|
|
||
| @property | ||
| def max_source_positions(self) -> int: |
There was a problem hiding this comment.
rather make it an attribute called max_position_embeddings in the encoder config like (most) other models? as mentioned here
There was a problem hiding this comment.
Done, thanks! Just a small addition on the __post_init__ override given VibeVoice's value is derived from acoustic_tokenizer_chunk_size (the single source of truth in our case) unlike the other static budgets linked in the aforementioned comment. Thought we'd like to avoid cases where a user doubles chunk_size and the budget is kept silently stale; would love to know if there's a more standardized way to do it, or if you'd like the onus to be on the user to make sure of this when they pass acoustic_tokenizer_chunk_size and max_position_embeddings both :)
There was a problem hiding this comment.
thanks for quick change! left some comments but would be nice to get thoughts from others :)
| self.text_config = CONFIG_MAPPING["qwen2"]() | ||
|
|
||
| # max_position_embeddings is derived, not a static field; refresh from chunk_size and hop_length. | ||
| hop_length = int(math.prod(self.acoustic_tokenizer_encoder_config.downsampling_ratios)) |
| self.acoustic_tokenizer_encoder_config.max_position_embeddings = math.ceil( | ||
| self.acoustic_tokenizer_chunk_size / hop_length | ||
| ) |
There was a problem hiding this comment.
although this does feel a bit weird... I don't think there's another model that has something like acoustic_tokenizer_chunk_size when calling the model forward with something different than the default (here) so we don't really have a precedent for this.
But I feel like, e.g. doubling the "budget" as mentioned here, should rather be done in the forward if a user sets acoustic_tokenizer_chunk_size to something different from the default? Maybe around here
For context (if @vasqu if taking a look), the acoustic_tokenizer_chunk_size parameter was introduced so users can adapt the GPU memory needed by the tokenizer to avoid OOM.
There was a problem hiding this comment.
Acknowledged regarding the three options! Fwiw vLLM reads max_position_embeddings during profiling before any forward call; would probably set a precedent for how forward-time overrides are reflected to downstream consumers. Happy to hold the code change here until it's decided which approach is better since this is novel territory. Was also thinking that @hmellor's perspective would be valuable as well :)
There was a problem hiding this comment.
Hmm, this is indeed weird. Any reason this was introduced as parameter and not via the config value directly? I think we opened a can of worms here where we now have to respect the config value and parameter no?
Also maybe the property solution might be more suited as this is kind of a dynamic value that can change easily if one of the values is set after construction, e.g. config = ...; config.acoustic_tokenizer_chunk_size = 12
There was a problem hiding this comment.
it is both an argument to forward and config value.
so we could do the property approach based on the value of acoustic_tokenizer_chunk_size at construction? Iirc, you should also check the chunk size is a multiple of the hop length like this
yeah looking back we shouldn't have allowed both, and just setting at construction would have been enough.
There was a problem hiding this comment.
No worries, things like these happen easily! I think we can just raise a simple value error for those cases then
There was a problem hiding this comment.
Thank you both for taking the time! Please do let me know if the changes made with b6dc9 so far align with the direction I could gather from this thread; happy to make further changes as well in continuation.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: vibevoice_asr |
What does this PR do?
→ Fixes vllm-project/vllm#39330 (comment)
→ Exposes
max_source_positionsonVibeVoiceAsrConfigso profiling can resolve the model's max audio token budget.→ Companion to the vLLM Transformers audio backend PR. vLLM's
get_max_audio_tokenslooks for one ofmax_source_positions,max_position_embeddingsormax_pos_emband VibeVoice exposes none of these (its acoustic tokenizer is ConvNet-based with no positional embeddings) so profiling raisesValueError. With this change the same path resolves cleanly to450(max_source_positions = ceil(1440000 (60s @ 24kHz) / 3200) = 450, which is already in the config, is now exposed through the property).→ Made sure this doesn't cause any regressions in
tests/models/vibevoice_asr/.cc: @eustlb @vasqu
Code Agent Policy
Before submitting