[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context by TheCodeWrangler · Pull Request #35894 · vllm-project/vllm

TheCodeWrangler · 2026-03-03T18:25:40Z

Summary

The Qwen3-ASR realtime WebSocket endpoint (/v1/realtime) currently transcribes each audio segment in complete isolation — no cross-segment context, no output post-processing, and the input_stream feedback mechanism accepted but never consumed. This produces significantly degraded transcription compared to the REST batch endpoint, with repeated content at segment boundaries and raw language English<asr_text> format leaking to clients.

This PR rewrites the realtime streaming logic to match the official Qwen3-ASR SDK streaming approach, where each inference step sends all accumulated audio plus a text prefix of previously-decoded output (with a small rollback to let the model re-decide boundary tokens).

Addresses #35767.

What changed

Model layer (qwen3_asr_realtime.py):

buffer_realtime_audio now accumulates all audio (not fixed 5s segments) and re-infers over the full buffer each step
Wires up the input_stream feedback loop: generated tokens flow back via _collect_generation → _rollback_prefix → next segment's prompt prefix
Prefix rollback drops the trailing N tokens before feeding back, so the model can correct word boundaries and punctuation at segment junctions
_cap_prefix_tokens prevents unbounded prefix growth

Connection layer (connection.py):

Implements holdback: only the "stable" portion of each segment (minus trailing rollback tokens) is sent as a transcription.delta — the uncertain tail is held back until the next segment confirms it or stream end flushes it
Detects segment boundaries via prompt_token_ids length changes and EOS tokens
Passes full prefix strings (not just lengths) through a shared deque so delta deduplication tokenizes identically to the model's rollback

Session configuration (protocol.py, serving.py):

Adds RealtimeSessionConfig dataclass with per-session tunable parameters
Clients can set segment_duration_s, rollback_tokens, unfixed_chunks, max_prefix_tokens, max_audio_s, and realtime_max_tokens via session.update

Engine fix (async_llm.py):

AsyncLLM.generate() no longer terminates the overall stream when an individual segment finishes with finished=True — only the STREAM_FINISHED sentinel ends the generation loop

Bug fix (qwen3_asr.py):

Guards mm_options.get("audio") against None to prevent AttributeError when mm_options is not provided

Streaming architecture

Client (WebSocket)          Connection              Model (buffer_realtime_audio)
─────────────────          ──────────              ────────────────────────────
audio chunks ──────────►  audio_queue  ──────────►  Qwen3ASRRealtimeBuffer
                                                         │
                           ◄─── StreamingInput ◄────── yield TokensPrompt(all_audio + prefix)
                                                         │
engine.generate() ◄────────────────────────────────────  │
     │                                                   │
     ├── output tokens ──► input_stream.put(tok_ids) ──► _collect_generation()
     │                         │                              │
     │                    holdback_rollback()            _rollback_prefix()
     │                         │                              │
     │                    delta to client              prefix for next segment
     │                                                        │
     └── segment finished ─► input_stream.put([]) ──────────► │ (sentinel → next iteration)

Session configuration parameters

All parameters are optional and have sensible defaults. They can be set per-session via session.update:

segment_duration_s (default: 2.0) — seconds of new audio to accumulate before triggering re-inference
rollback_tokens (default: 5) — tokens dropped from the end of the previous transcription before it becomes the next segment's prefix; higher values give the model more room to revise boundaries but increase latency
unfixed_chunks (default: 2) — initial segments that skip prefix rollback entirely (the model has too little context to benefit from it)
max_prefix_tokens (default: 1024) — hard cap on prefix length in tokens to bound prompt size
max_audio_s (default: 300) — max accumulated audio duration before trimming oldest samples
realtime_max_tokens (default: 128) — max generation tokens per segment
language — ISO-639-1 code (e.g. "en") for the language {Lang}<asr_text> prompt prefix
prompt — free-text context hint injected as a system message (e.g. "user talking about their org/organization")

Test plan

Manual WebSocket streaming test with 30s audio clip at various speeds (1x, 2x, 5x, 15x realtime)
Verify deltas concatenate cleanly without character-level artifacts
Verify session.update parameters propagate through and affect streaming behavior
Verify REST /v1/audio/transcriptions endpoint is unaffected
Verify the mm_options null-check fix doesn't regress dummy input generation

Made with Cursor

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the Qwen3-ASR realtime streaming functionality by adopting the SDK-style cross-segment context approach, which should substantially enhance transcription quality. However, a high-severity prompt injection vulnerability was identified in the prompt construction logic, as untrusted user input from the session configuration is directly embedded into the model's prompt without sanitization, allowing for potential manipulation of the model's behavior via ChatML delimiters. Furthermore, the review also identified high-severity issues related to error handling that could mask bugs and an inconsistent default value for segment duration. Addressing these points will be crucial for the robustness, correctness, and security of this new implementation.

…ntext The Qwen3-ASR realtime endpoint previously transcribed each audio segment in isolation, producing degraded output with repeated content, raw format leaks, and no cross-segment context. This rewrites the realtime streaming logic to match the official Qwen3-ASR SDK approach: - Wire up input_stream feedback loop so generated tokens flow back into the next segment's prompt as a text prefix - Implement prefix rollback (configurable rollback_tokens) so the model can re-decide word boundaries at segment junctions - Apply holdback on the connection side to only send stable text as deltas, flushing the remainder at stream end - Add RealtimeSessionConfig with per-session tuning parameters (segment_duration_s, rollback_tokens, unfixed_chunks, max_prefix_tokens, max_audio_s, realtime_max_tokens) configurable via session.update - Fix AsyncLLM.generate() to not terminate on individual segment finished=True when processing streaming input - Fix mm_options null check in Qwen3ASRDummyInputsBuilder Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor

oktrained · 2026-03-10T10:57:57Z

Hello, when will I be able to test the realtime transcription? The state-maintained approach looks like a good one, as it matches the implementation used in the Qwen realtime demo. I also observed the same thing as you the current realtime endpoint stateless does not produce correct output.

TheCodeWrangler · 2026-03-12T15:20:40Z

Hello, when will I be able to test the realtime transcription? The state-maintained approach looks like a good one, as it matches the implementation used in the Qwen realtime demo. I also observed the same thing as you the current realtime endpoint stateless does not produce correct output.

The implementation from qwen directly is just monotonically increasing the audio window sent to ASR and then on time window T+1 it uses a prefix of the output of T but strips the last 5 tokens.

This will not work for realtime in use cases where audio can be arbitrarily long. For that reason we cannot just package the implementation from the qwen3 asr demo.

A custom wrapping implementation would be needed, but I think the vllm community needs broader input into how that should be managed within vllm.

lifeiteng · 2026-03-17T10:12:23Z

@TheCodeWrangler

Could you provide a quick update on the current status? Specifically, are there any remaining blockers or specific areas where the community could help with testing?

Looking forward to seeing this merged!

Resolve conflicts after main moved realtime websocket code from vllm/entrypoints/openai/realtime/* to vllm/entrypoints/speech_to_text/realtime/* and changed two APIs: - vllm.inputs.data.PromptType -> vllm.inputs.PromptType (already re-exported); adopted the shorter import. - _check_model() now returns an ErrorResponse whose message lives at .error.message; adopted main's safer handling that also rejects missing-model session.update events explicitly. Kept this PR's tok_params plumbing through render_cmpl_async so the realtime websocket respects max_total_tokens / max_output_tokens. Also widen SupportsRealtime.buffer_realtime_audio with **kwargs: Any so subclasses (Qwen3-ASR realtime, etc.) can extend the streaming control surface with model-specific options like prefix_texts, rollback_tokens, segment_duration_s, etc. without breaking the Protocol. The dispatcher in realtime/serving.py already forwards these as kwargs. Signed-off-by: Nathan Price <nathan@abridge.com> Co-authored-by: Cursor <cursoragent@cursor.com>

TheCodeWrangler · 2026-05-19T13:48:36Z

Closing this draft to focus attention on the smaller-scoped Qwen3-ASR work in #35415 and #36018. The cross-segment streaming approach here still seems right to me and the rebased code lives on TheCodeWrangler/vllm:fix/qwen3-asr-realtime-improvements if anyone wants to pick it up. Thanks @lifeiteng @oktrained for the interest along the way.

mergify Bot added frontend qwen Related to Qwen models v1 labels Mar 3, 2026

TheCodeWrangler force-pushed the fix/qwen3-asr-realtime-improvements branch from c5b2072 to f9ccdb2 Compare March 3, 2026 18:26

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_asr_realtime.py

Comment thread vllm/entrypoints/openai/realtime/connection.py Outdated

Comment thread vllm/entrypoints/openai/realtime/protocol.py Outdated

TheCodeWrangler force-pushed the fix/qwen3-asr-realtime-improvements branch from f9ccdb2 to 917b866 Compare March 3, 2026 18:34

TheCodeWrangler force-pushed the fix/qwen3-asr-realtime-improvements branch from 917b866 to af37f91 Compare March 3, 2026 19:40

This was referenced Mar 3, 2026

[Enhancement]: Qwen3-ASR realtime endpoint produces degraded output — stateless segments, no cross-segment context, raw format leaks #35767

Open

[RFC]: Model-specific realtime streaming abstraction #35908

Open

BWAAEEEK mentioned this pull request May 13, 2026

[Bugfix] Fix Qwen3-ASR transcription streaming postprocessing #42478

Open

TheCodeWrangler closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894

[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894
TheCodeWrangler wants to merge 2 commits into
vllm-project:mainfrom
TheCodeWrangler:fix/qwen3-asr-realtime-improvements

TheCodeWrangler commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oktrained commented Mar 10, 2026

Uh oh!

TheCodeWrangler commented Mar 12, 2026

Uh oh!

lifeiteng commented Mar 17, 2026

Uh oh!

TheCodeWrangler commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TheCodeWrangler commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Streaming architecture

Session configuration parameters

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oktrained commented Mar 10, 2026

Uh oh!

TheCodeWrangler commented Mar 12, 2026

Uh oh!

lifeiteng commented Mar 17, 2026

Uh oh!

TheCodeWrangler commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheCodeWrangler commented Mar 3, 2026 •

edited

Loading