Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
7c816f9
support granite speech nar model
avihu111 May 18, 2026
bb5b648
attempt to use modular - inherit from the Granite base LLM - changing…
avihu111 May 18, 2026
a46f94d
minor
avihu111 May 18, 2026
925bfdb
minor
avihu111 May 18, 2026
daf4f93
minor
avihu111 May 18, 2026
65729c2
add docs, rename components, skip submodule tests
avihu111 May 18, 2026
fa81168
fix check_config_docstrings_have_checkpoints
avihu111 May 18, 2026
f72b333
add dates
avihu111 May 18, 2026
151c709
save processor imports
avihu111 May 18, 2026
d5d851c
avoid a crash without torch available
avihu111 May 19, 2026
0c29229
fix unguarded torch usage in typing
avihu111 May 19, 2026
e592181
minor
avihu111 May 19, 2026
480801c
clean up variable names, avoid spliting the language_model/lm_head fo…
avihu111 May 19, 2026
2a95b12
change encoder bpe prediction head to match the editor. (original voc…
avihu111 May 21, 2026
43e739a
Merge branch 'main' of https://github.com/avihu111/transformers into …
avihu111 May 26, 2026
16efa34
add integration tests for granite speech nar
avihu111 May 26, 2026
4707df0
ruff newline
avihu111 May 26, 2026
18abe6e
minor fixes after pulling main
avihu111 May 26, 2026
8da331d
minor
avihu111 May 26, 2026
e0a51e5
- renames (GraniteSpeechNarForCTC, generate)
avihu111 May 27, 2026
3fc93c2
shared helper methods for ctc loss and ctc decoding, shared between t…
avihu111 May 27, 2026
97e7c1b
move the encoding code to GraniteSpeechNarModel, keep just the lm_hea…
avihu111 May 27, 2026
367e0ee
add logits scaling (as used in granite lm_head)
avihu111 May 27, 2026
3db6757
add multi-step editing support. (without recomputing audio_embeds)
avihu111 May 27, 2026
17bc4b6
minor
avihu111 May 27, 2026
77eb543
- resolve a depracation warning
avihu111 May 27, 2026
36f2745
Update src/transformers/models/granite_speech_nar/modular_granite_spe…
avihu111 Jun 3, 2026
70c06b5
Update src/transformers/models/granite_speech_nar/modular_granite_spe…
avihu111 Jun 3, 2026
3ffc779
simplify: encoder always has bpe head
avihu111 Jun 3, 2026
0d6af41
use @capture_outputs for encoder hidden states
avihu111 Jun 3, 2026
67b3303
Update src/transformers/models/granite_speech_nar/modular_granite_spe…
avihu111 Jun 3, 2026
7d03457
Update src/transformers/models/granite_speech_nar/modular_granite_spe…
avihu111 Jun 3, 2026
fd8fa8a
Update src/transformers/models/granite_speech_nar/modular_granite_spe…
avihu111 Jun 3, 2026
212668f
simplify setting is_causal=False
avihu111 Jun 3, 2026
c1223d4
use more standard tests
avihu111 Jun 3, 2026
032cc08
minor fixes after refactoring the tests
avihu111 Jun 3, 2026
dc7901b
minor
avihu111 Jun 3, 2026
28fed5e
resolve some issues in the tests:
avihu111 Jun 3, 2026
77473e5
stack logits to pass test_determinism
avihu111 Jun 3, 2026
8d089f5
make labels test consistent
avihu111 Jun 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1078,6 +1078,8 @@
title: GLM-ASR
- local: model_doc/granite_speech
title: GraniteSpeech
- local: model_doc/granite_speech_nar
title: GraniteSpeechNar
- local: model_doc/granite_speech_plus
title: GraniteSpeechPlus
- local: model_doc/higgs_audio_v2
Expand Down
72 changes: 72 additions & 0 deletions docs/source/en/model_doc/granite_speech_nar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<!--Copyright 2026 IBM and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on 2026-03-09 and added to Hugging Face Transformers on 2026-06-03.*

# GraniteSpeechNar

## Overview

GraniteSpeechNar is a non-autoregressive (NAR) speech recognition model based on [NLE: Non-autoregressive LLM-based ASR by Transcript Editing](https://huggingface.co/papers/2603.08397). It formulates ASR as conditional transcript editing, achieving fully parallel prediction with significant speedups over autoregressive baselines.

The model consists of:

1. **Conformer Encoder**: A conformer encoder trained with CTC on BPE targets, using block-attention and self-conditioned CTC from the middle layer.

2. **QFormer Projector**: A windowed query-transformer that maps multi-layer encoder features to the LLM embedding space with temporal downsampling.

3. **Bidirectional Granite LLM**: A Granite language model with bidirectional (non-causal) attention that refines CTC predictions in a single forward pass.

The model performs inference in a single pass: the encoder produces initial CTC predictions, which are interleaved with blank insertion slots (exploiting the identity mapping bias of Transformers) and fed alongside projected audio embeddings to the bidirectional LLM for refinement via a latent alignment objective.

This model was contributed by [Avihu Dekel](https://huggingface.co/Avihu).

## GraniteSpeechNarConfig

[[autodoc]] GraniteSpeechNarConfig

## GraniteSpeechNarEncoderConfig

[[autodoc]] GraniteSpeechNarEncoderConfig

## GraniteSpeechNarProjectorConfig

[[autodoc]] GraniteSpeechNarProjectorConfig

## GraniteSpeechNarProcessor

[[autodoc]] GraniteSpeechNarProcessor
- __call__
- batch_decode

## GraniteSpeechNarFeatureExtractor

[[autodoc]] GraniteSpeechNarFeatureExtractor

## GraniteSpeechNarModel

[[autodoc]] GraniteSpeechNarModel
- forward

## GraniteSpeechNarLanguageModel

[[autodoc]] GraniteSpeechNarLanguageModel
- forward

## GraniteSpeechNarForCTC

[[autodoc]] GraniteSpeechNarForCTC
- forward
- generate
2 changes: 2 additions & 0 deletions src/transformers/conversion_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@
"audioflamingo3": "qwen2_audio",
"glmasr": "qwen2_audio",
"musicflamingo": "qwen2_audio",
"granite_speech_nar": "granite_speech",
"granite_speech_plus": "granite_speech",
"gemma3n_text": "qwen3_5_text",
"qwen3_5_moe_text": "qwen3_5_text",
Expand All @@ -116,6 +117,7 @@
"AudioFlamingo3Model": "Qwen2AudioModel",
"GlmAsrModel": "Qwen2AudioModel",
"MusicFlamingoModel": "Qwen2AudioModel",
"GraniteSpeechNarModel": "GraniteSpeechModel",
"GraniteSpeechPlusModel": "GraniteSpeechModel",
"MaskFormerDetrDecoder": "DetrModel",
"Qwen2_5_VLForConditionalGeneration": "Qwen2VLForConditionalGeneration",
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@
from .granite import *
from .granite4_vision import *
from .granite_speech import *
from .granite_speech_nar import *
from .granite_speech_plus import *
from .granitemoe import *
from .granitemoehybrid import *
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/auto_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,9 @@
("granite4_vision_text", "Granite4VisionTextConfig"),
("granite_speech", "GraniteSpeechConfig"),
("granite_speech_encoder", "GraniteSpeechEncoderConfig"),
("granite_speech_nar", "GraniteSpeechNarConfig"),
("granite_speech_nar_encoder", "GraniteSpeechNarEncoderConfig"),
("granite_speech_nar_projector", "GraniteSpeechNarProjectorConfig"),
("granite_speech_plus", "GraniteSpeechPlusConfig"),
("granite_speech_plus_encoder", "GraniteSpeechPlusEncoderConfig"),
("granitemoe", "GraniteMoeConfig"),
Expand Down Expand Up @@ -733,6 +736,8 @@
("glmasr_encoder", "glmasr"),
("granite4_vision_text", "granite4_vision"),
("granite_speech_encoder", "granite_speech"),
("granite_speech_nar_encoder", "granite_speech_nar"),
("granite_speech_nar_projector", "granite_speech_nar"),
("granite_speech_plus_encoder", "granite_speech_plus"),
("grounding-dino", "grounding_dino"),
("groupvit_text_model", "groupvit"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
("gemma4", "Gemma4AudioFeatureExtractor"),
("glmasr", "WhisperFeatureExtractor"),
("granite_speech", "GraniteSpeechFeatureExtractor"),
("granite_speech_nar", "GraniteSpeechNarFeatureExtractor"),
("granite_speech_plus", "GraniteSpeechFeatureExtractor"),
("higgs_audio_v2_tokenizer", "DacFeatureExtractor"),
("hubert", "Wav2Vec2FeatureExtractor"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("granite", "GraniteModel"),
("granite4_vision", "Granite4VisionModel"),
("granite_speech", "GraniteSpeechModel"),
("granite_speech_nar", "GraniteSpeechNarForCTC"),
("granite_speech_plus", "GraniteSpeechPlusModel"),
("granitemoe", "GraniteMoeModel"),
("granitemoehybrid", "GraniteMoeHybridModel"),
Expand Down Expand Up @@ -1672,6 +1673,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
[
# Model for Connectionist temporal classification (CTC) mapping
("data2vec-audio", "Data2VecAudioForCTC"),
("granite_speech_nar", "GraniteSpeechNarForCTC"),
("hubert", "HubertForCTC"),
("lasr_ctc", "LasrForCTC"),
("parakeet_ctc", "ParakeetForCTC"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
("got_ocr2", "GotOcr2Processor"),
("granite4_vision", "Granite4VisionProcessor"),
("granite_speech", "GraniteSpeechProcessor"),
("granite_speech_nar", "GraniteSpeechNarProcessor"),
("granite_speech_plus", "GraniteSpeechProcessor"),
("grounding-dino", "GroundingDinoProcessor"),
("groupvit", "CLIPProcessor"),
Expand Down
29 changes: 29 additions & 0 deletions src/transformers/models/granite_speech_nar/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2026 IBM and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_granite_speech_nar import *
from .feature_extraction_granite_speech_nar import *
from .modeling_granite_speech_nar import *
from .processing_granite_speech_nar import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading