Support Granite Speech NAR (NLE) by avihu111 · Pull Request #46031 · huggingface/transformers

avihu111 · 2026-05-18T12:52:13Z

What does this PR do?

Adds GraniteSpeechNar — a non-autoregressive LLM-based ASR model.
Unlike the autoregressive GraniteSpeech model (which uses GenerationMixin), this model performs non-autoregressive transcript editing, and refines the speech encoder transcript using a bidirectionally-augmented LLM.
GraniteSpeechNar consists of an encoder (conformer), a q-former projector, and a bidirectional LLM (non-causal mask).
The paper is based on the following paper: NLE: Non-autoregressive LLM-based ASR by Transcript Editing

It's a bit different than most of the models in the hub (due to its non-autoregressive nature), but achieves high transcription throughput and competative accuracy.
It is ranked 3rd on the Open ASR Leaderboard with faster inference speeds. The speedups are even more significant with a batch size of 1 (e.g. in latency critical real-time settings).

The model is available with bundled code here:
https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar

Key design decisions

Inherit Granite classes, changing the attention pattern to bidirectional.
Inherit the base conformer encoder from GraniteSpeech, which is shared.
output_encoder_logits=False (default): Free the encoder BPE logits tensor (~T/4 × 100K) to reduce peak memory. Pass True to retain it.
Flat sequence batching: All batch items are concatenated into [1, N_total, D] with per-sample position resets, avoiding padding waste on variable-length audio.
CTC collapse (deduplication and blank removal) is done in the model, decoding is done in the processor.
Supports finetuning the model - exposes the editing loss, encoder loss and the copying-regularization loss described in the paper above.

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

audio models: @eustlb @ebezzam @vasqu
Thanks!
CC: @gsaon

… the attention pattern to bidirectional. Inherit the conformer encoder from GraniteSpeech

…rwards

…ab, same blank token)

…granite_speech_nar

eustlb

Thanks a lot for opening this PR 🤗 first points to iterate on

eustlb · 2026-05-26T14:49:43Z

+
+class GraniteSpeechNarCTCEncoder(GraniteSpeechNarPreTrainedModel):
+    """Conformer encoder with BPE CTC head and multi-layer output."""
+
+    config_class = GraniteSpeechNarEncoderConfig
+
+    def __init__(self, config: GraniteSpeechNarEncoderConfig):


I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right? https://arxiv.org/abs/2603.08397 mentions that the fully joint training is not explored in this approach. Therefore, we might not want to add training for it here.

Likewise, why don't we simply use AutoModel.from_config with granite_speech_encoder? Isn't this just the same as granite speech encoder?

I do get that we're not using GraniteSpeechCTCEncoder because we're adding training support here, but this is for training the ctc-based encoder, not this Nar model, right?
Isn't this just the same as granite speech encoder?

It's not exactly the same encoder. Besides training support, we also add a bpe prediction head with the _posterior_weighted_pool downsampling method before the BPE head predictions.
I can try to inherit the GraniteSpeechCTCEncoder and add this functionality on top of it.

the fully joint training is not explored in this approach

We managed to see clear gains from fully joint training, and it actually improves multi-step inference (applying multiple editing steps). In case someone wants to adapt this model with new data, I would recommend updating both the encoder and LLM with LoRA adapters (that's the reason to expose this loss).

Ok I see, thanks a lot for the clarification. Indeed you can leverage modular to add the BPE head to GraniteSpeechCTCEncoder

Started looking into this - looks like we'll only save 4-5 lines (in init), since the forward is custom.
and found a few extra mismatches: control over self-conditioning layer location, different dropout in the prediction head.
I'd like to suggest not changing this to modular. WYDT?

eustlb · 2026-05-26T15:06:18Z

+    and a bidirectional Granite LLM backbone that refines CTC predictions in a single pass.
+    """
+)
+class GraniteSpeechNarForASR(GraniteSpeechNarPreTrainedModel):


Let's just name it ForCTC, since this is what is used in modelign_auto.py

sounds good!

eustlb · 2026-05-26T15:07:27Z

+            text_config._attn_implementation = config._attn_implementation
+        self.language_model = GraniteSpeechNarLM._from_config(text_config)
+


self.language_model should be a base model, and the lm_head should be defined outside of it here, seee #45534

NP! I remember it wasn't separate in GraniteSpeech before, but its good to keep multimodal models consistent.

done - moved most of the logic to GraniteSpeechNarModel, the GraniteModelNarForCTC runs the lm_head and computes the loss.

- GraniteSpeechNarForCTC has a base multimodal model + lm_head - fix test with older encoder bpe head. - reuse conversion mapping by granite speech

…he encoder/editor. minor docs fix

…d + losses in the ForCTC. Create a new output type for for the NarModel class.

avihu111 · 2026-05-27T10:25:39Z

@eustlb I finished another pass of updates:

renamed transcribe to generate
renamed ForASR to ForCTC
refactored to use base_model with the lm_head separated
shared ctc_loss/collapse helper functions for the encoder/editor
multi-editing inference support (exposed via num_editing_steps=1 in generate)

Ready for the next iteration 🤗

- frame stacking as a parameter, pad instead of truncating audio.

eustlb

thanks for iterating 🤗 ... and let's iterate again

eustlb · 2026-06-03T08:49:14Z

+        self,
+        input_features: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        output_hidden_states: bool | None = None,


The whole output_hidden_states logic should be handled with the @capture_outputs decorator
``suggestion

Done - hopefully correctly :)

eustlb · 2026-06-03T08:58:36Z

+        if config.bpe_output_dim is not None:
+            self.out_bpe = nn.Linear(config.hidden_dim, config.bpe_output_dim, bias=True)


the released model always have out_bpe right? let's not make this optional and remove the if self.out_bpe is not None logic in the forward

good idea, done!

eustlb · 2026-06-03T09:00:31Z

+        loss = None
+        if self.out_bpe is not None and blank_probs is not None:
+            pool_window = self.config.bpe_pooling_window


blank_probs should never be None no? (since self_conditioning_layer is always set)
let's remove fully this if, we should always return logits

eustlb · 2026-06-03T09:01:04Z

    "models/gemma4_assistant/test_modeling_gemma4_assistant.py",
+    "models/granite_speech_nar/test_modeling_granite_speech_nar.py",
 ]


it shouldn't

can you elaborate on this one?

eustlb · 2026-06-03T09:02:22Z

+# Copyright 2026 IBM and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#


this is not aligned with the way we design tests. Please check test_modeling_parakeet.py for reference and update

tried to make the tests more aligned - and made efforts to pass as many tests as I can (with a NAR model)

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

support positional arguments main_input_name enable encoder gradient checkpointing

github-actions · 2026-06-03T14:55:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, granite_speech_nar

avihu111 added 11 commits May 18, 2026 12:38

support granite speech nar model

7c816f9

attempt to use modular - inherit from the Granite base LLM - changing…

bb5b648

… the attention pattern to bidirectional. Inherit the conformer encoder from GraniteSpeech

minor

a46f94d

minor

925bfdb

minor

daf4f93

add docs, rename components, skip submodule tests

65729c2

fix check_config_docstrings_have_checkpoints

fa81168

add dates

f72b333

save processor imports

151c709

avoid a crash without torch available

d5d851c

fix unguarded torch usage in typing

0c29229

avihu111 changed the title ~~support granite speech nar model~~ Support Granite Speech NAR (NLE) May 19, 2026

minor

e592181

avihu111 marked this pull request as ready for review May 19, 2026 05:47

github-actions Bot requested review from ArthurZucker and Rocketknight1 May 19, 2026 05:47

clean up variable names, avoid spliting the language_model/lm_head fo…

480801c

…rwards

eustlb self-assigned this May 19, 2026

eustlb added New model Audio labels May 19, 2026

avihu111 added 6 commits May 21, 2026 09:53

change encoder bpe prediction head to match the editor. (original voc…

2a95b12

…ab, same blank token)

Merge branch 'main' of https://github.com/avihu111/transformers into …

43e739a

…granite_speech_nar

add integration tests for granite speech nar

16efa34

ruff newline

4707df0

minor fixes after pulling main

18abe6e

minor

8da331d

eustlb reviewed May 26, 2026

View reviewed changes

avihu111 added 3 commits May 27, 2026 07:09

- renames (GraniteSpeechNarForCTC, generate)

e0a51e5

- GraniteSpeechNarForCTC has a base multimodal model + lm_head - fix test with older encoder bpe head. - reuse conversion mapping by granite speech

shared helper methods for ctc loss and ctc decoding, shared between t…

3fc93c2

…he encoder/editor. minor docs fix

move the encoding code to GraniteSpeechNarModel, keep just the lm_hea…

97e7c1b

…d + losses in the ForCTC. Create a new output type for for the NarModel class.

avihu111 added 3 commits May 27, 2026 08:21

add logits scaling (as used in granite lm_head)

367e0ee

add multi-step editing support. (without recomputing audio_embeds)

3db6757

minor

17bc4b6

- resolve a depracation warning

77eb543

- frame stacking as a parameter, pad instead of truncating audio.

eustlb reviewed Jun 3, 2026

View reviewed changes

avihu111 and others added 14 commits June 3, 2026 12:18

Update src/transformers/models/granite_speech_nar/modular_granite_spe…

36f2745

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/granite_speech_nar/modular_granite_spe…

70c06b5

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

simplify: encoder always has bpe head

3ffc779

use @capture_outputs for encoder hidden states

0d6af41

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update src/transformers/models/granite_speech_nar/modular_granite_spe…

67b3303

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/granite_speech_nar/modular_granite_spe…

7d03457

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/granite_speech_nar/modular_granite_spe…

fd8fa8a

…ech_nar.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

simplify setting is_causal=False

212668f

use more standard tests

c1223d4

minor fixes after refactoring the tests

032cc08

minor

dc7901b

resolve some issues in the tests:

28fed5e

support positional arguments main_input_name enable encoder gradient checkpointing

stack logits to pass test_determinism

77473e5

make labels test consistent

8d089f5

		text_config._attn_implementation = config._attn_implementation
		self.language_model = GraniteSpeechNarLM._from_config(text_config)

		if config.bpe_output_dim is not None:
		self.out_bpe = nn.Linear(config.hidden_dim, config.bpe_output_dim, bias=True)

Conversation

avihu111 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key design decisions

Code Agent Policy

Before submitting

Who can review?

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avihu111 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avihu111 commented May 27, 2026

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avihu111 commented May 18, 2026 •

edited

Loading

avihu111 May 26, 2026 •

edited

Loading