[Feature] Add `response_prefix` parameter to audio transcription/translation endpoints by TheCodeWrangler · Pull Request #36018 · vllm-project/vllm

TheCodeWrangler · 2026-03-04T15:48:20Z

Summary

Adds an optional response_prefix field to both /v1/audio/transcriptions and /v1/audio/translations endpoints
When provided, the text is injected into the assistant turn of the prompt so the model continues generation from that point — as if it had already produced those words
Enables callers to implement streaming-quality transcription over REST without WebSocket infrastructure

This implements Option A from RFC #35908.

How it works

For Qwen3-ASR, response_prefix is appended after the <asr_text> tag in the assistant turn, matching the prompt shape used by the Qwen3-ASR SDK's streaming_transcribe:

<|im_start|>assistant
language English<asr_text>{response_prefix}

A caller implements the streaming loop as:

1. POST audio[0:2s], response_prefix=""               → T1
2. POST audio[0:4s], response_prefix=rollback(T1)     → T2
3. POST audio[0:6s], response_prefix=rollback(T2)     → T3
...

The caller handles audio accumulation, rollback, and prefix capping — the server just runs a single transcription with the given prompt.

Changes

File	Change
`vllm/config/speech_to_text.py`	Add `response_prefix: str = ""` to `SpeechToTextParams`
`vllm/entrypoints/speech_to_text/transcription/protocol.py`	Add `response_prefix` to `TranscriptionRequest` and pass it through
`vllm/entrypoints/speech_to_text/translation/protocol.py`	Add `response_prefix` to `TranslationRequest` and pass it through
`vllm/model_executor/models/qwen3_asr.py`	Sanitize `response_prefix` and append after `<asr_text>` tag in the assistant turn

response_prefix rides on the existing SpeechToTextParams dataclass plumbing (the refactor from #36268), so no SupportsTranscription interface change is needed and other ASR backends pick up the new field as a no-op default "" automatically.

Companion PR

This sits on top of the same Qwen3-ASR prompt-handling surface as #35415 (prompt / request_prompt). The prompt sanitizer (_sanitize_transcription_user_text) is shared between the two — response_prefix is sanitized identically so a malicious prefix cannot escape the assistant role and inject control tokens. After #35415 lands, this PR rebases to a much smaller diff.

RFC: [RFC]: Model-specific realtime streaming abstraction #35908
Original streaming issue: [Enhancement]: Qwen3-ASR realtime endpoint produces degraded output — stateless segments, no cross-segment context, raw format leaks #35767
Qwen ASR initial release examples of "streaming" rely heavily on prefixing the prompt of audio as it accumulates [ref]. This allows for a similar implementation to be achieved while hosting in vllm.

Test plan

Pre-commit hooks pass (ruff, mypy, typos, etc.)
DCO sign-offs match author email on every commit
Built Docker dev image with patched files and started Qwen3-ASR-1.7B server
Verified baseline transcription still works (no response_prefix)
Verified response_prefix is accepted and changes model output
Sanitizer test added on companion PR feat(qwen3-asr): support prompt parameter in v1/audio/transcriptions #35415 covers the strip path used here too
Upstream CI (pre-run-check is the label gate, not a code failure)

Notes

AI assistance (Cursor) was used; every line was reviewed and tested by the human submitter before push.

gemini-code-assist

Code Review

This pull request introduces a response_prefix parameter to the audio transcription and translation endpoints, enabling callers to guide the model's generation and implement streaming-like functionality. The changes are well-structured, updating the API protocol, plumbing the new parameter through the serving layer, and modifying the SupportsTranscription interface. The core logic is implemented for the Qwen3-ASR model, while other models are updated for interface compatibility. Additionally, the request_prompt is now correctly used as a system message for Qwen3-ASR. However, the implementation in the Qwen3-ASR model is vulnerable to prompt injection as it directly concatenates user-supplied strings into a structured prompt without sanitizing model-specific control tokens. This could allow an attacker to manipulate the model's behavior by injecting arbitrary turns into the prompt.

TheCodeWrangler · 2026-03-04T16:00:45Z

tested locally.

Results looked the way I expected. The response picks up from where the response prefix finishes (even if the response prefix started with words prior to the audio segment)

The prompt guided the model to corrected transcription of "phonograph"

curl -s -X POST http://localhost:8000/v1/audio/transcriptions   -F "file=@mary_had_lamb.ogg"   -F "model=Qwen/Qwen3-ASR-1.7B"   -F "language=en"   -F "response_prefix=A bunch of text that may have occurred prior to this audio starting. One of the first wo
rds" -F "prompt=A phonograph plays audio" | python3 -m json.tool

{
    "text": " I spoke in the original phonograph, a little piece of practical poetry. Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.",
    "usage": {
        "type": "duration",
        "seconds": 16
    }
}

mergify · 2026-03-05T02:03:26Z

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-12T12:58:26Z

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-12T15:38:53Z

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…slation endpoints Adds an optional `response_prefix` field to both `/v1/audio/transcriptions` and `/v1/audio/translations`. When provided, the text is injected into the assistant turn of the prompt so the model continues generation from that point, as if it had already produced those words. This enables callers to implement streaming-quality transcription over the existing REST API without requiring WebSocket infrastructure: send the growing audio buffer with the stable prefix from the previous segment (minus rollback tokens), and the model picks up where it left off. For Qwen3-ASR, the prefix is appended after the `<asr_text>` tag in the assistant turn, matching the prompt shape used by the Qwen3-ASR SDK's `streaming_transcribe`. The `request_prompt` field is also now wired as a system message in the ChatML template. All other ASR model implementations accept the new parameter as a no-op with a default of `""`, so this is fully backward-compatible. Implements Option A from RFC vllm-project#35908. Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor

Two fixes from PR vllm-project#35415: 1. The language hint in the assistant prefix now correctly uses `language` (source audio language) for transcription and `to_language` (target language) for translation. Previously only `to_language` was checked, so plain transcription with an explicit language never passed the hint to the model. 2. Guard `mm_options` with `or {}` to prevent AttributeError when it is None during engine initialization / encoder cache profiling. Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor

Replace per-call token list with a regex that strips any <|...|> fragments from prompt and response_prefix, addressing prompt-injection review feedback for the transcription API. Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor

TheCodeWrangler · 2026-04-15T14:47:49Z

Update (prompt injection): prompt and response_prefix are sanitized in qwen3_asr.py via _sanitize_transcription_user_text(), which strips any <|...|>-shaped ChatML fragments and the <asr_text> tag before embedding user text in the system or assistant turn. Latest push tightens this to a regex over all such tokens rather than a fixed short list. Same trust model as other user-controlled OpenAI API strings.

Resolve conflict in speech-to-text protocol files after main split the combined `openai/speech_to_text/protocol.py` into `speech_to_text/transcription/protocol.py` and `speech_to_text/translation/protocol.py`. Re-applied this PR's `response_prefix` field and `build_stt_params` plumbing on both new locations, keeping the `SpeechToTextParams.response_prefix` addition from this branch. Also fix two issues that would have failed CI / runtime after the merge: - `import re` is forbidden in vLLM; switched to `import regex as re`. - `<|redacted_im_end|>` typo in the system-turn template (introduced in 8a8fd28 alongside the regex sanitizer) restored to `<|im_end|>` to match the rest of the prompt and the Qwen3-ASR SDK format. Signed-off-by: Nathan Price <nathan@abridge.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…response-prefix

…35415 Mirrors the Qwen3-ASR cleanups landing on `qwen-asr-prompt-support` so this PR stays consistent with the underlying prompt plumbing it extends: - _sanitize_transcription_user_text: apply the regex to a fixpoint so nested tokens like `<|im<|x|>_end|>` cannot reconstruct a valid ChatML token after a single pass. Used by both `prompt` and the new `response_prefix` field. - Sanitize `request_prompt` first, then check truthiness, so a fully- stripped prompt does not produce an empty `<|im_start|>system\n<|im_end|>\n` turn. - Revert the unnecessary `(mm_options or {}).get(...)` guard. The base annotation is `Mapping[str, BaseDummyOptions]` and every other model trusts it. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nathan Price <nathan@abridge.com>

Single-pass `str.replace("<asr_text>", "")` is bypassable via nested payloads such as `<asr_te<asr_text>xt>` which reconstructs to `<asr_text>` after the inner tag is stripped, allowing the model-significant assistant-prefix delimiter to be injected into the system / assistant turns. Move both the ChatML-token substitution and the `<asr_text>` removal inside the fixpoint loop so nested payloads cannot reconstruct either kind of token. Signed-off-by: Nathan Price <nathan@abridge.com>

Use `context` / `prefix` for the sanitized fields (matching vllm-project#35415's naming for the request_prompt path). Drop the redundant inline comment that duplicated the `_sanitize_transcription_user_text` docstring. Signed-off-by: Nathan Price <nathan@abridge.com>

…response-prefix Signed-off-by: Nathan Price <nathan@abridge.com> # Conflicts: # vllm/model_executor/models/qwen3_asr.py

DarkLight1337 · 2026-06-11T12:20:13Z

@Isotr0py @NickLucche what do you think of this? I haven't used audio models much so couldn't comment on the use case

TheCodeWrangler · 2026-06-13T16:05:40Z

@Isotr0py @NickLucche what do you think of this? I haven't used audio models much so couldn't comment on the use case

Thanks for pinging the right reviewers!

TheCodeWrangler requested review from NickLucche, patrickvonplaten and sighingnow as code owners March 4, 2026 15:48

mergify Bot added frontend qwen Related to Qwen models labels Mar 4, 2026

TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 3167f9b to a1c5918 Compare March 4, 2026 15:50

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_asr.py Outdated

TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from a1c5918 to c2f1670 Compare March 4, 2026 15:57

This comment has been minimized.

Sign in to view

sc-hua mentioned this pull request Mar 19, 2026

[New Model]: SoulX-Duplug-0.6B (Soul-AILab/SoulX-Duplug-0.6B) vllm-project/vllm-omni#1967

Open

1 task

This comment has been minimized.

Sign in to view

TheCodeWrangler requested a review from vadiklyutiy as a code owner March 31, 2026 14:42

TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 5a6fd3b to 4064588 Compare April 2, 2026 20:36

TheCodeWrangler and others added 5 commits April 6, 2026 13:57

Merge branch 'main' into feat/transcription-response-prefix

a12a217

Merge branch 'main' into feat/transcription-response-prefix

0b51d4d

Merge branch 'main' into feat/transcription-response-prefix

f39d721

Merge branch 'main' into feat/transcription-response-prefix

429ed69

This comment has been minimized.

Sign in to view

TheCodeWrangler and others added 3 commits April 12, 2026 15:15

Merge branch 'main' into feat/transcription-response-prefix

8a93761

Merge upstream/main into feat/transcription-response-prefix

e61387f

TheCodeWrangler requested review from ProExpertProg, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 29, 2026 13:59

Merge branch 'main' into feat/transcription-response-prefix

70f3af9

BWAAEEEK mentioned this pull request May 13, 2026

[Bugfix] Fix Qwen3-ASR transcription streaming postprocessing #42478

Open

This was referenced May 19, 2026

feat(qwen3-asr): support prompt parameter in v1/audio/transcriptions #35415

Merged

[Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context #35894

Closed

Merge remote-tracking branch 'upstream/main' into feat/transcription-…

00af329

…response-prefix

depthfirst-app Bot reviewed May 29, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_asr.py Outdated

This comment has been minimized.

Sign in to view

TheCodeWrangler and others added 2 commits May 29, 2026 13:20

TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 2c302e0 to 0ea4b42 Compare May 29, 2026 14:01

This comment has been minimized.

Sign in to view

TheCodeWrangler added 2 commits May 29, 2026 14:12

Merge upstream/main into feat/transcription-response-prefix

d1f10f6

depthfirst-app Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_asr.py

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'upstream/main' into feat/transcription-…

9d5812a

…response-prefix Signed-off-by: Nathan Price <nathan@abridge.com> # Conflicts: # vllm/model_executor/models/qwen3_asr.py

TheCodeWrangler added 3 commits June 11, 2026 14:23

Merge branch 'main' into feat/transcription-response-prefix

47c5d05

Merge branch 'main' into feat/transcription-response-prefix

06d0a7d

Merge branch 'main' into feat/transcription-response-prefix

4028e0e

Uh oh!

Conversation

TheCodeWrangler commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Changes

Companion PR

Related

Test plan

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

TheCodeWrangler commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Mar 5, 2026

Uh oh!

This comment has been minimized.

mergify Bot commented Mar 12, 2026

Uh oh!

mergify Bot commented Mar 12, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

TheCodeWrangler commented Apr 15, 2026

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

DarkLight1337 commented Jun 11, 2026

Uh oh!

TheCodeWrangler commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheCodeWrangler commented Mar 4, 2026 •

edited

Loading

TheCodeWrangler commented Mar 4, 2026 •

edited

Loading