Skip to content

Support Granite Speech NAR (NLE)#46031

Open
avihu111 wants to merge 40 commits into
huggingface:mainfrom
avihu111:granite_speech_nar
Open

Support Granite Speech NAR (NLE)#46031
avihu111 wants to merge 40 commits into
huggingface:mainfrom
avihu111:granite_speech_nar

Conversation

@avihu111

@avihu111 avihu111 commented May 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds GraniteSpeechNar — a non-autoregressive LLM-based ASR model.
Unlike the autoregressive GraniteSpeech model (which uses GenerationMixin), this model performs non-autoregressive transcript editing, and refines the speech encoder transcript using a bidirectionally-augmented LLM.
GraniteSpeechNar consists of an encoder (conformer), a q-former projector, and a bidirectional LLM (non-causal mask).
The paper is based on the following paper: NLE: Non-autoregressive LLM-based ASR by Transcript Editing

It's a bit different than most of the models in the hub (due to its non-autoregressive nature), but achieves high transcription throughput and competative accuracy.
It is ranked 3rd on the Open ASR Leaderboard with faster inference speeds. The speedups are even more significant with a batch size of 1 (e.g. in latency critical real-time settings).

The model is available with bundled code here:
https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar

Key design decisions

  • Inherit Granite classes, changing the attention pattern to bidirectional.
  • Inherit the base conformer encoder from GraniteSpeech, which is shared.
  • output_encoder_logits=False (default): Free the encoder BPE logits tensor (~T/4 × 100K) to reduce peak memory. Pass True to retain it.
  • Flat sequence batching: All batch items are concatenated into [1, N_total, D] with per-sample position resets, avoiding padding waste on variable-length audio.
  • CTC collapse (deduplication and blank removal) is done in the model, decoding is done in the processor.
  • Supports finetuning the model - exposes the editing loss, encoder loss and the copying-regularization loss described in the paper above.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@avihu111 avihu111 changed the title support granite speech nar model Support Granite Speech NAR (NLE) May 19, 2026
@avihu111 avihu111 marked this pull request as ready for review May 19, 2026 05:47

@eustlb eustlb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for opening this PR 🤗 first points to iterate on

Comment on lines +309 to +315

class GraniteSpeechNarCTCEncoder(GraniteSpeechNarPreTrainedModel):
"""Conformer encoder with BPE CTC head and multi-layer output."""

config_class = GraniteSpeechNarEncoderConfig

def __init__(self, config: GraniteSpeechNarEncoderConfig):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right? https://arxiv.org/abs/2603.08397 mentions that the fully joint training is not explored in this approach. Therefore, we might not want to add training for it here.

Likewise, why don't we simply use AutoModel.from_config with granite_speech_encoder? Isn't this just the same as granite speech encoder?

@avihu111 avihu111 May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right?
Isn't this just the same as granite speech encoder?

It's not exactly the same encoder. Besides training support, we also add a bpe prediction head with the _posterior_weighted_pool downsampling method before the BPE head predictions.
I can try to inherit the GraniteSpeechCTCEncoder and add this functionality on top of it.

the fully joint training is not explored in this approach

We managed to see clear gains from fully joint training, and it actually improves multi-step inference (applying multiple editing steps). In case someone wants to adapt this model with new data, I would recommend updating both the encoder and LLM with LoRA adapters (that's the reason to expose this loss).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, thanks a lot for the clarification. Indeed you can leverage modular to add the BPE head to GraniteSpeechCTCEncoder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started looking into this - looks like we'll only save 4-5 lines (in init), since the forward is custom.
and found a few extra mismatches: control over self-conditioning layer location, different dropout in the prediction head.
I'd like to suggest not changing this to modular. WYDT?

and a bidirectional Granite LLM backbone that refines CTC predictions in a single pass.
"""
)
class GraniteSpeechNarForASR(GraniteSpeechNarPreTrainedModel):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just name it ForCTC, since this is what is used in modelign_auto.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

Comment on lines +434 to +436
text_config._attn_implementation = config._attn_implementation
self.language_model = GraniteSpeechNarLM._from_config(text_config)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.language_model should be a base model, and the lm_head should be defined outside of it here, seee #45534

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP! I remember it wasn't separate in GraniteSpeech before, but its good to keep multimodal models consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - moved most of the logic to GraniteSpeechNarModel, the GraniteModelNarForCTC runs the lm_head and computes the loss.

avihu111 added 3 commits May 27, 2026 07:09
- GraniteSpeechNarForCTC has a base multimodal model + lm_head
- fix test with older encoder bpe head.
- reuse conversion mapping by granite speech
…d + losses in the ForCTC.

Create a new output type for for the NarModel class.
@avihu111

Copy link
Copy Markdown
Contributor Author

@eustlb I finished another pass of updates:

  • renamed transcribe to generate
  • renamed ForASR to ForCTC
  • refactored to use base_model with the lm_head separated
  • shared ctc_loss/collapse helper functions for the encoder/editor
  • multi-editing inference support (exposed via num_editing_steps=1 in generate)

Ready for the next iteration 🤗

- frame stacking as a parameter, pad instead of truncating audio.

@eustlb eustlb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for iterating 🤗 ... and let's iterate again

Comment thread src/transformers/models/granite_speech_nar/modular_granite_speech_nar.py Outdated
Comment thread src/transformers/models/granite_speech_nar/modular_granite_speech_nar.py Outdated
Comment thread src/transformers/models/granite_speech_nar/modular_granite_speech_nar.py Outdated
Comment thread src/transformers/models/granite_speech_nar/modular_granite_speech_nar.py Outdated
Comment thread src/transformers/models/granite_speech_nar/modular_granite_speech_nar.py Outdated
self,
input_features: torch.Tensor,
attention_mask: torch.Tensor | None = None,
output_hidden_states: bool | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole output_hidden_states logic should be handled with the @capture_outputs decorator
``suggestion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - hopefully correctly :)

Comment on lines +385 to +386
if config.bpe_output_dim is not None:
self.out_bpe = nn.Linear(config.hidden_dim, config.bpe_output_dim, bias=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the released model always have out_bpe right? let's not make this optional and remove the if self.out_bpe is not None logic in the forward

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done!

Comment on lines +426 to +428
loss = None
if self.out_bpe is not None and blank_probs is not None:
pool_window = self.config.bpe_pooling_window

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank_probs should never be None no? (since self_conditioning_layer is always set)
let's remove fully this if, we should always return logits

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Comment thread utils/check_repo.py
Comment on lines 315 to 317
"models/gemma4_assistant/test_modeling_gemma4_assistant.py",
"models/granite_speech_nar/test_modeling_granite_speech_nar.py",
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on this one?

Comment on lines +1 to +6
# Copyright 2026 IBM and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not aligned with the way we design tests. Please check test_modeling_parakeet.py for reference and update

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to make the tests more aligned - and made efforts to pass as many tests as I can (with a NAR model)

avihu111 and others added 14 commits June 3, 2026 12:18
…ech_nar.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ech_nar.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ech_nar.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
support positional arguments
main_input_name
enable encoder gradient checkpointing
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, granite_speech_nar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants