Support Granite Speech NAR (NLE)#46031
Conversation
… the attention pattern to bidirectional. Inherit the conformer encoder from GraniteSpeech
…ab, same blank token)
…granite_speech_nar
eustlb
left a comment
There was a problem hiding this comment.
Thanks a lot for opening this PR 🤗 first points to iterate on
|
|
||
| class GraniteSpeechNarCTCEncoder(GraniteSpeechNarPreTrainedModel): | ||
| """Conformer encoder with BPE CTC head and multi-layer output.""" | ||
|
|
||
| config_class = GraniteSpeechNarEncoderConfig | ||
|
|
||
| def __init__(self, config: GraniteSpeechNarEncoderConfig): |
There was a problem hiding this comment.
I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right? https://arxiv.org/abs/2603.08397 mentions that the fully joint training is not explored in this approach. Therefore, we might not want to add training for it here.
Likewise, why don't we simply use AutoModel.from_config with granite_speech_encoder? Isn't this just the same as granite speech encoder?
There was a problem hiding this comment.
I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right?
Isn't this just the same as granite speech encoder?
It's not exactly the same encoder. Besides training support, we also add a bpe prediction head with the _posterior_weighted_pool downsampling method before the BPE head predictions.
I can try to inherit the GraniteSpeechCTCEncoder and add this functionality on top of it.
the fully joint training is not explored in this approach
We managed to see clear gains from fully joint training, and it actually improves multi-step inference (applying multiple editing steps). In case someone wants to adapt this model with new data, I would recommend updating both the encoder and LLM with LoRA adapters (that's the reason to expose this loss).
There was a problem hiding this comment.
Ok I see, thanks a lot for the clarification. Indeed you can leverage modular to add the BPE head to GraniteSpeechCTCEncoder
There was a problem hiding this comment.
Started looking into this - looks like we'll only save 4-5 lines (in init), since the forward is custom.
and found a few extra mismatches: control over self-conditioning layer location, different dropout in the prediction head.
I'd like to suggest not changing this to modular. WYDT?
| and a bidirectional Granite LLM backbone that refines CTC predictions in a single pass. | ||
| """ | ||
| ) | ||
| class GraniteSpeechNarForASR(GraniteSpeechNarPreTrainedModel): |
There was a problem hiding this comment.
Let's just name it ForCTC, since this is what is used in modelign_auto.py
| text_config._attn_implementation = config._attn_implementation | ||
| self.language_model = GraniteSpeechNarLM._from_config(text_config) | ||
|
|
There was a problem hiding this comment.
self.language_model should be a base model, and the lm_head should be defined outside of it here, seee #45534
There was a problem hiding this comment.
NP! I remember it wasn't separate in GraniteSpeech before, but its good to keep multimodal models consistent.
There was a problem hiding this comment.
done - moved most of the logic to GraniteSpeechNarModel, the GraniteModelNarForCTC runs the lm_head and computes the loss.
- GraniteSpeechNarForCTC has a base multimodal model + lm_head - fix test with older encoder bpe head. - reuse conversion mapping by granite speech
…he encoder/editor. minor docs fix
…d + losses in the ForCTC. Create a new output type for for the NarModel class.
|
@eustlb I finished another pass of updates:
Ready for the next iteration 🤗 |
- frame stacking as a parameter, pad instead of truncating audio.
eustlb
left a comment
There was a problem hiding this comment.
thanks for iterating 🤗 ... and let's iterate again
| self, | ||
| input_features: torch.Tensor, | ||
| attention_mask: torch.Tensor | None = None, | ||
| output_hidden_states: bool | None = None, |
There was a problem hiding this comment.
The whole output_hidden_states logic should be handled with the @capture_outputs decorator
``suggestion
There was a problem hiding this comment.
Done - hopefully correctly :)
| if config.bpe_output_dim is not None: | ||
| self.out_bpe = nn.Linear(config.hidden_dim, config.bpe_output_dim, bias=True) |
There was a problem hiding this comment.
the released model always have out_bpe right? let's not make this optional and remove the if self.out_bpe is not None logic in the forward
| loss = None | ||
| if self.out_bpe is not None and blank_probs is not None: | ||
| pool_window = self.config.bpe_pooling_window |
There was a problem hiding this comment.
blank_probs should never be None no? (since self_conditioning_layer is always set)
let's remove fully this if, we should always return logits
| "models/gemma4_assistant/test_modeling_gemma4_assistant.py", | ||
| "models/granite_speech_nar/test_modeling_granite_speech_nar.py", | ||
| ] |
There was a problem hiding this comment.
can you elaborate on this one?
| # Copyright 2026 IBM and The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # |
There was a problem hiding this comment.
this is not aligned with the way we design tests. Please check test_modeling_parakeet.py for reference and update
There was a problem hiding this comment.
tried to make the tests more aligned - and made efforts to pass as many tests as I can (with a NAR model)
…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
support positional arguments main_input_name enable encoder gradient checkpointing
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, granite_speech_nar |
What does this PR do?
Adds GraniteSpeechNar — a non-autoregressive LLM-based ASR model.
Unlike the autoregressive
GraniteSpeechmodel (which usesGenerationMixin), this model performs non-autoregressive transcript editing, and refines the speech encoder transcript using a bidirectionally-augmented LLM.GraniteSpeechNar consists of an encoder (conformer), a q-former projector, and a bidirectional LLM (non-causal mask).
The paper is based on the following paper: NLE: Non-autoregressive LLM-based ASR by Transcript Editing
It's a bit different than most of the models in the hub (due to its non-autoregressive nature), but achieves high transcription throughput and competative accuracy.
It is ranked 3rd on the Open ASR Leaderboard with faster inference speeds. The speedups are even more significant with a batch size of 1 (e.g. in latency critical real-time settings).
The model is available with bundled code here:
https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
Key design decisions
output_encoder_logits=False(default): Free the encoder BPE logits tensor (~T/4 × 100K) to reduce peak memory. PassTrueto retain it.[1, N_total, D]with per-sample position resets, avoiding padding waste on variable-length audio.Code Agent Policy
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Thanks!
CC: @gsaon