[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894
[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context#35894TheCodeWrangler wants to merge 2 commits into
Conversation
c5b2072 to
f9ccdb2
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement to the Qwen3-ASR realtime streaming functionality by adopting the SDK-style cross-segment context approach, which should substantially enhance transcription quality. However, a high-severity prompt injection vulnerability was identified in the prompt construction logic, as untrusted user input from the session configuration is directly embedded into the model's prompt without sanitization, allowing for potential manipulation of the model's behavior via ChatML delimiters. Furthermore, the review also identified high-severity issues related to error handling that could mask bugs and an inconsistent default value for segment duration. Addressing these points will be crucial for the robustness, correctness, and security of this new implementation.
f9ccdb2 to
917b866
Compare
…ntext The Qwen3-ASR realtime endpoint previously transcribed each audio segment in isolation, producing degraded output with repeated content, raw format leaks, and no cross-segment context. This rewrites the realtime streaming logic to match the official Qwen3-ASR SDK approach: - Wire up input_stream feedback loop so generated tokens flow back into the next segment's prompt as a text prefix - Implement prefix rollback (configurable rollback_tokens) so the model can re-decide word boundaries at segment junctions - Apply holdback on the connection side to only send stable text as deltas, flushing the remainder at stream end - Add RealtimeSessionConfig with per-session tuning parameters (segment_duration_s, rollback_tokens, unfixed_chunks, max_prefix_tokens, max_audio_s, realtime_max_tokens) configurable via session.update - Fix AsyncLLM.generate() to not terminate on individual segment finished=True when processing streaming input - Fix mm_options null check in Qwen3ASRDummyInputsBuilder Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor
917b866 to
af37f91
Compare
|
Hello, when will I be able to test the realtime transcription? The state-maintained approach looks like a good one, as it matches the implementation used in the Qwen realtime demo. I also observed the same thing as you the current realtime endpoint stateless does not produce correct output. |
The implementation from qwen directly is just monotonically increasing the audio window sent to ASR and then on time window T+1 it uses a prefix of the output of T but strips the last 5 tokens. This will not work for realtime in use cases where audio can be arbitrarily long. For that reason we cannot just package the implementation from the qwen3 asr demo. A custom wrapping implementation would be needed, but I think the vllm community needs broader input into how that should be managed within vllm. |
|
Could you provide a quick update on the current status? Specifically, are there any remaining blockers or specific areas where the community could help with testing? Looking forward to seeing this merged! |
Resolve conflicts after main moved realtime websocket code from vllm/entrypoints/openai/realtime/* to vllm/entrypoints/speech_to_text/realtime/* and changed two APIs: - vllm.inputs.data.PromptType -> vllm.inputs.PromptType (already re-exported); adopted the shorter import. - _check_model() now returns an ErrorResponse whose message lives at .error.message; adopted main's safer handling that also rejects missing-model session.update events explicitly. Kept this PR's tok_params plumbing through render_cmpl_async so the realtime websocket respects max_total_tokens / max_output_tokens. Also widen SupportsRealtime.buffer_realtime_audio with **kwargs: Any so subclasses (Qwen3-ASR realtime, etc.) can extend the streaming control surface with model-specific options like prefix_texts, rollback_tokens, segment_duration_s, etc. without breaking the Protocol. The dispatcher in realtime/serving.py already forwards these as kwargs. Signed-off-by: Nathan Price <nathan@abridge.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
Closing this draft to focus attention on the smaller-scoped Qwen3-ASR work in #35415 and #36018. The cross-segment streaming approach here still seems right to me and the rebased code lives on |
Summary
The Qwen3-ASR realtime WebSocket endpoint (
/v1/realtime) currently transcribes each audio segment in complete isolation — no cross-segment context, no output post-processing, and theinput_streamfeedback mechanism accepted but never consumed. This produces significantly degraded transcription compared to the REST batch endpoint, with repeated content at segment boundaries and rawlanguage English<asr_text>format leaking to clients.This PR rewrites the realtime streaming logic to match the official Qwen3-ASR SDK streaming approach, where each inference step sends all accumulated audio plus a text prefix of previously-decoded output (with a small rollback to let the model re-decide boundary tokens).
Addresses #35767.
What changed
Model layer (
qwen3_asr_realtime.py):buffer_realtime_audionow accumulates all audio (not fixed 5s segments) and re-infers over the full buffer each stepinput_streamfeedback loop: generated tokens flow back via_collect_generation→_rollback_prefix→ next segment's prompt prefix_cap_prefix_tokensprevents unbounded prefix growthConnection layer (
connection.py):transcription.delta— the uncertain tail is held back until the next segment confirms it or stream end flushes itprompt_token_idslength changes and EOS tokensSession configuration (
protocol.py,serving.py):RealtimeSessionConfigdataclass with per-session tunable parameterssegment_duration_s,rollback_tokens,unfixed_chunks,max_prefix_tokens,max_audio_s, andrealtime_max_tokensviasession.updateEngine fix (
async_llm.py):AsyncLLM.generate()no longer terminates the overall stream when an individual segment finishes withfinished=True— only theSTREAM_FINISHEDsentinel ends the generation loopBug fix (
qwen3_asr.py):mm_options.get("audio")againstNoneto preventAttributeErrorwhenmm_optionsis not providedStreaming architecture
Session configuration parameters
All parameters are optional and have sensible defaults. They can be set per-session via
session.update:segment_duration_s(default: 2.0) — seconds of new audio to accumulate before triggering re-inferencerollback_tokens(default: 5) — tokens dropped from the end of the previous transcription before it becomes the next segment's prefix; higher values give the model more room to revise boundaries but increase latencyunfixed_chunks(default: 2) — initial segments that skip prefix rollback entirely (the model has too little context to benefit from it)max_prefix_tokens(default: 1024) — hard cap on prefix length in tokens to bound prompt sizemax_audio_s(default: 300) — max accumulated audio duration before trimming oldest samplesrealtime_max_tokens(default: 128) — max generation tokens per segmentlanguage— ISO-639-1 code (e.g."en") for thelanguage {Lang}<asr_text>prompt prefixprompt— free-text context hint injected as a system message (e.g."user talking about their org/organization")Test plan
session.updateparameters propagate through and affect streaming behavior/v1/audio/transcriptionsendpoint is unaffectedmm_optionsnull-check fix doesn't regress dummy input generationMade with Cursor