Skip to content

[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894

Closed
TheCodeWrangler wants to merge 2 commits into
vllm-project:mainfrom
TheCodeWrangler:fix/qwen3-asr-realtime-improvements
Closed

[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894
TheCodeWrangler wants to merge 2 commits into
vllm-project:mainfrom
TheCodeWrangler:fix/qwen3-asr-realtime-improvements

Conversation

@TheCodeWrangler

@TheCodeWrangler TheCodeWrangler commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Summary

The Qwen3-ASR realtime WebSocket endpoint (/v1/realtime) currently transcribes each audio segment in complete isolation — no cross-segment context, no output post-processing, and the input_stream feedback mechanism accepted but never consumed. This produces significantly degraded transcription compared to the REST batch endpoint, with repeated content at segment boundaries and raw language English<asr_text> format leaking to clients.

This PR rewrites the realtime streaming logic to match the official Qwen3-ASR SDK streaming approach, where each inference step sends all accumulated audio plus a text prefix of previously-decoded output (with a small rollback to let the model re-decide boundary tokens).

Addresses #35767.

What changed

Model layer (qwen3_asr_realtime.py):

  • buffer_realtime_audio now accumulates all audio (not fixed 5s segments) and re-infers over the full buffer each step
  • Wires up the input_stream feedback loop: generated tokens flow back via _collect_generation_rollback_prefix → next segment's prompt prefix
  • Prefix rollback drops the trailing N tokens before feeding back, so the model can correct word boundaries and punctuation at segment junctions
  • _cap_prefix_tokens prevents unbounded prefix growth

Connection layer (connection.py):

  • Implements holdback: only the "stable" portion of each segment (minus trailing rollback tokens) is sent as a transcription.delta — the uncertain tail is held back until the next segment confirms it or stream end flushes it
  • Detects segment boundaries via prompt_token_ids length changes and EOS tokens
  • Passes full prefix strings (not just lengths) through a shared deque so delta deduplication tokenizes identically to the model's rollback

Session configuration (protocol.py, serving.py):

  • Adds RealtimeSessionConfig dataclass with per-session tunable parameters
  • Clients can set segment_duration_s, rollback_tokens, unfixed_chunks, max_prefix_tokens, max_audio_s, and realtime_max_tokens via session.update

Engine fix (async_llm.py):

  • AsyncLLM.generate() no longer terminates the overall stream when an individual segment finishes with finished=True — only the STREAM_FINISHED sentinel ends the generation loop

Bug fix (qwen3_asr.py):

  • Guards mm_options.get("audio") against None to prevent AttributeError when mm_options is not provided

Streaming architecture

Client (WebSocket)          Connection              Model (buffer_realtime_audio)
─────────────────          ──────────              ────────────────────────────
audio chunks ──────────►  audio_queue  ──────────►  Qwen3ASRRealtimeBuffer
                                                         │
                           ◄─── StreamingInput ◄────── yield TokensPrompt(all_audio + prefix)
                                                         │
engine.generate() ◄────────────────────────────────────  │
     │                                                   │
     ├── output tokens ──► input_stream.put(tok_ids) ──► _collect_generation()
     │                         │                              │
     │                    holdback_rollback()            _rollback_prefix()
     │                         │                              │
     │                    delta to client              prefix for next segment
     │                                                        │
     └── segment finished ─► input_stream.put([]) ──────────► │ (sentinel → next iteration)

Session configuration parameters

All parameters are optional and have sensible defaults. They can be set per-session via session.update:

  • segment_duration_s (default: 2.0) — seconds of new audio to accumulate before triggering re-inference
  • rollback_tokens (default: 5) — tokens dropped from the end of the previous transcription before it becomes the next segment's prefix; higher values give the model more room to revise boundaries but increase latency
  • unfixed_chunks (default: 2) — initial segments that skip prefix rollback entirely (the model has too little context to benefit from it)
  • max_prefix_tokens (default: 1024) — hard cap on prefix length in tokens to bound prompt size
  • max_audio_s (default: 300) — max accumulated audio duration before trimming oldest samples
  • realtime_max_tokens (default: 128) — max generation tokens per segment
  • language — ISO-639-1 code (e.g. "en") for the language {Lang}<asr_text> prompt prefix
  • prompt — free-text context hint injected as a system message (e.g. "user talking about their org/organization")

Test plan

  • Manual WebSocket streaming test with 30s audio clip at various speeds (1x, 2x, 5x, 15x realtime)
  • Verify deltas concatenate cleanly without character-level artifacts
  • Verify session.update parameters propagate through and affect streaming behavior
  • Verify REST /v1/audio/transcriptions endpoint is unaffected
  • Verify the mm_options null-check fix doesn't regress dummy input generation

Made with Cursor

@mergify mergify Bot added frontend qwen Related to Qwen models v1 labels Mar 3, 2026
@TheCodeWrangler TheCodeWrangler force-pushed the fix/qwen3-asr-realtime-improvements branch from c5b2072 to f9ccdb2 Compare March 3, 2026 18:26

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the Qwen3-ASR realtime streaming functionality by adopting the SDK-style cross-segment context approach, which should substantially enhance transcription quality. However, a high-severity prompt injection vulnerability was identified in the prompt construction logic, as untrusted user input from the session configuration is directly embedded into the model's prompt without sanitization, allowing for potential manipulation of the model's behavior via ChatML delimiters. Furthermore, the review also identified high-severity issues related to error handling that could mask bugs and an inconsistent default value for segment duration. Addressing these points will be crucial for the robustness, correctness, and security of this new implementation.

Comment thread vllm/model_executor/models/qwen3_asr_realtime.py
Comment thread vllm/entrypoints/openai/realtime/connection.py Outdated
Comment thread vllm/entrypoints/openai/realtime/protocol.py Outdated
@TheCodeWrangler TheCodeWrangler force-pushed the fix/qwen3-asr-realtime-improvements branch from f9ccdb2 to 917b866 Compare March 3, 2026 18:34
…ntext

The Qwen3-ASR realtime endpoint previously transcribed each audio
segment in isolation, producing degraded output with repeated content,
raw format leaks, and no cross-segment context.

This rewrites the realtime streaming logic to match the official
Qwen3-ASR SDK approach:

- Wire up input_stream feedback loop so generated tokens flow back
  into the next segment's prompt as a text prefix
- Implement prefix rollback (configurable rollback_tokens) so the
  model can re-decide word boundaries at segment junctions
- Apply holdback on the connection side to only send stable text as
  deltas, flushing the remainder at stream end
- Add RealtimeSessionConfig with per-session tuning parameters
  (segment_duration_s, rollback_tokens, unfixed_chunks,
  max_prefix_tokens, max_audio_s, realtime_max_tokens) configurable
  via session.update
- Fix AsyncLLM.generate() to not terminate on individual segment
  finished=True when processing streaming input
- Fix mm_options null check in Qwen3ASRDummyInputsBuilder

Made-with: Cursor
Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
@oktrained

Copy link
Copy Markdown

Hello, when will I be able to test the realtime transcription? The state-maintained approach looks like a good one, as it matches the implementation used in the Qwen realtime demo. I also observed the same thing as you the current realtime endpoint stateless does not produce correct output.

@TheCodeWrangler

Copy link
Copy Markdown
Contributor Author

Hello, when will I be able to test the realtime transcription? The state-maintained approach looks like a good one, as it matches the implementation used in the Qwen realtime demo. I also observed the same thing as you the current realtime endpoint stateless does not produce correct output.

The implementation from qwen directly is just monotonically increasing the audio window sent to ASR and then on time window T+1 it uses a prefix of the output of T but strips the last 5 tokens.

This will not work for realtime in use cases where audio can be arbitrarily long. For that reason we cannot just package the implementation from the qwen3 asr demo.

A custom wrapping implementation would be needed, but I think the vllm community needs broader input into how that should be managed within vllm.

@lifeiteng

Copy link
Copy Markdown

@TheCodeWrangler

Could you provide a quick update on the current status? Specifically, are there any remaining blockers or specific areas where the community could help with testing?

Looking forward to seeing this merged!

Resolve conflicts after main moved realtime websocket code from
vllm/entrypoints/openai/realtime/* to
vllm/entrypoints/speech_to_text/realtime/* and changed two APIs:

- vllm.inputs.data.PromptType -> vllm.inputs.PromptType (already
  re-exported); adopted the shorter import.
- _check_model() now returns an ErrorResponse whose message lives at
  .error.message; adopted main's safer handling that also rejects
  missing-model session.update events explicitly.

Kept this PR's tok_params plumbing through render_cmpl_async so the
realtime websocket respects max_total_tokens / max_output_tokens.

Also widen SupportsRealtime.buffer_realtime_audio with **kwargs: Any
so subclasses (Qwen3-ASR realtime, etc.) can extend the streaming
control surface with model-specific options like prefix_texts,
rollback_tokens, segment_duration_s, etc. without breaking the
Protocol. The dispatcher in realtime/serving.py already forwards
these as kwargs.

Signed-off-by: Nathan Price <nathan@abridge.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@TheCodeWrangler

Copy link
Copy Markdown
Contributor Author

Closing this draft to focus attention on the smaller-scoped Qwen3-ASR work in #35415 and #36018. The cross-segment streaming approach here still seems right to me and the rebased code lives on TheCodeWrangler/vllm:fix/qwen3-asr-realtime-improvements if anyone wants to pick it up. Thanks @lifeiteng @oktrained for the interest along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants