Skip to content

[Feature] Add response_prefix parameter to audio transcription/translation endpoints#36018

Open
TheCodeWrangler wants to merge 23 commits into
vllm-project:mainfrom
TheCodeWrangler:feat/transcription-response-prefix
Open

[Feature] Add response_prefix parameter to audio transcription/translation endpoints#36018
TheCodeWrangler wants to merge 23 commits into
vllm-project:mainfrom
TheCodeWrangler:feat/transcription-response-prefix

Conversation

@TheCodeWrangler

@TheCodeWrangler TheCodeWrangler commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds an optional response_prefix field to both /v1/audio/transcriptions and /v1/audio/translations endpoints
  • When provided, the text is injected into the assistant turn of the prompt so the model continues generation from that point — as if it had already produced those words
  • Enables callers to implement streaming-quality transcription over REST without WebSocket infrastructure

This implements Option A from RFC #35908.

How it works

For Qwen3-ASR, response_prefix is appended after the <asr_text> tag in the assistant turn, matching the prompt shape used by the Qwen3-ASR SDK's streaming_transcribe:

<|im_start|>assistant
language English<asr_text>{response_prefix}

A caller implements the streaming loop as:

1. POST audio[0:2s], response_prefix=""               → T1
2. POST audio[0:4s], response_prefix=rollback(T1)     → T2
3. POST audio[0:6s], response_prefix=rollback(T2)     → T3
...

The caller handles audio accumulation, rollback, and prefix capping — the server just runs a single transcription with the given prompt.

Changes

File Change
vllm/config/speech_to_text.py Add response_prefix: str = "" to SpeechToTextParams
vllm/entrypoints/speech_to_text/transcription/protocol.py Add response_prefix to TranscriptionRequest and pass it through
vllm/entrypoints/speech_to_text/translation/protocol.py Add response_prefix to TranslationRequest and pass it through
vllm/model_executor/models/qwen3_asr.py Sanitize response_prefix and append after <asr_text> tag in the assistant turn

response_prefix rides on the existing SpeechToTextParams dataclass plumbing (the refactor from #36268), so no SupportsTranscription interface change is needed and other ASR backends pick up the new field as a no-op default "" automatically.

Companion PR

This sits on top of the same Qwen3-ASR prompt-handling surface as #35415 (prompt / request_prompt). The prompt sanitizer (_sanitize_transcription_user_text) is shared between the two — response_prefix is sanitized identically so a malicious prefix cannot escape the assistant role and inject control tokens. After #35415 lands, this PR rebases to a much smaller diff.

Related

Test plan

  • Pre-commit hooks pass (ruff, mypy, typos, etc.)
  • DCO sign-offs match author email on every commit
  • Built Docker dev image with patched files and started Qwen3-ASR-1.7B server
  • Verified baseline transcription still works (no response_prefix)
  • Verified response_prefix is accepted and changes model output
  • Sanitizer test added on companion PR feat(qwen3-asr): support prompt parameter in v1/audio/transcriptions #35415 covers the strip path used here too
  • Upstream CI (pre-run-check is the label gate, not a code failure)

Notes

  • AI assistance (Cursor) was used; every line was reviewed and tested by the human submitter before push.

@mergify mergify Bot added frontend qwen Related to Qwen models labels Mar 4, 2026
@TheCodeWrangler TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 3167f9b to a1c5918 Compare March 4, 2026 15:50

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a response_prefix parameter to the audio transcription and translation endpoints, enabling callers to guide the model's generation and implement streaming-like functionality. The changes are well-structured, updating the API protocol, plumbing the new parameter through the serving layer, and modifying the SupportsTranscription interface. The core logic is implemented for the Qwen3-ASR model, while other models are updated for interface compatibility. Additionally, the request_prompt is now correctly used as a system message for Qwen3-ASR. However, the implementation in the Qwen3-ASR model is vulnerable to prompt injection as it directly concatenates user-supplied strings into a structured prompt without sanitizing model-specific control tokens. This could allow an attacker to manipulate the model's behavior by injecting arbitrary turns into the prompt.

Comment thread vllm/model_executor/models/qwen3_asr.py Outdated
@TheCodeWrangler TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from a1c5918 to c2f1670 Compare March 4, 2026 15:57
@TheCodeWrangler

TheCodeWrangler commented Mar 4, 2026

Copy link
Copy Markdown
Contributor Author

tested locally.

Results looked the way I expected. The response picks up from where the response prefix finishes (even if the response prefix started with words prior to the audio segment)

The prompt guided the model to corrected transcription of "phonograph"

curl -s -X POST http://localhost:8000/v1/audio/transcriptions   -F "file=@mary_had_lamb.ogg"   -F "model=Qwen/Qwen3-ASR-1.7B"   -F "language=en"   -F "response_prefix=A bunch of text that may have occurred prior to this audio starting. One of the first wo
rds" -F "prompt=A phonograph plays audio" | python3 -m json.tool

{
    "text": " I spoke in the original phonograph, a little piece of practical poetry. Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.",
    "usage": {
        "type": "duration",
        "seconds": 16
    }
}

@mergify

mergify Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@TheCodeWrangler

This comment has been minimized.

@mergify

mergify Bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify

mergify Bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Hi @TheCodeWrangler, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@TheCodeWrangler

This comment has been minimized.

@TheCodeWrangler

This comment has been minimized.

…slation endpoints

Adds an optional `response_prefix` field to both `/v1/audio/transcriptions`
and `/v1/audio/translations`. When provided, the text is injected into the
assistant turn of the prompt so the model continues generation from that
point, as if it had already produced those words.

This enables callers to implement streaming-quality transcription over the
existing REST API without requiring WebSocket infrastructure: send the
growing audio buffer with the stable prefix from the previous segment
(minus rollback tokens), and the model picks up where it left off.

For Qwen3-ASR, the prefix is appended after the `<asr_text>` tag in the
assistant turn, matching the prompt shape used by the Qwen3-ASR SDK's
`streaming_transcribe`. The `request_prompt` field is also now wired as a
system message in the ChatML template.

All other ASR model implementations accept the new parameter as a no-op
with a default of `""`, so this is fully backward-compatible.

Implements Option A from RFC vllm-project#35908.

Made-with: Cursor
Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
@TheCodeWrangler TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 5a6fd3b to 4064588 Compare April 2, 2026 20:36
TheCodeWrangler and others added 5 commits April 6, 2026 13:57
Two fixes from PR vllm-project#35415:

1. The language hint in the assistant prefix now correctly uses
   `language` (source audio language) for transcription and
   `to_language` (target language) for translation. Previously only
   `to_language` was checked, so plain transcription with an explicit
   language never passed the hint to the model.

2. Guard `mm_options` with `or {}` to prevent AttributeError when
   it is None during engine initialization / encoder cache profiling.

Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
@TheCodeWrangler

This comment has been minimized.

@TheCodeWrangler

This comment has been minimized.

TheCodeWrangler and others added 3 commits April 12, 2026 15:15
Replace per-call token list with a regex that strips any <|...|> fragments
from prompt and response_prefix, addressing prompt-injection review feedback
for the transcription API.

Signed-off-by: Nathan Price <nathan@abridge.com>
Made-with: Cursor
@TheCodeWrangler

Copy link
Copy Markdown
Contributor Author

Update (prompt injection): prompt and response_prefix are sanitized in qwen3_asr.py via _sanitize_transcription_user_text(), which strips any <|...|>-shaped ChatML fragments and the <asr_text> tag before embedding user text in the system or assistant turn. Latest push tightens this to a regex over all such tokens rather than a fixed short list. Same trust model as other user-controlled OpenAI API strings.

Resolve conflict in speech-to-text protocol files after main split
the combined `openai/speech_to_text/protocol.py` into
`speech_to_text/transcription/protocol.py` and
`speech_to_text/translation/protocol.py`. Re-applied this PR's
`response_prefix` field and `build_stt_params` plumbing on both new
locations, keeping the `SpeechToTextParams.response_prefix` addition
from this branch.

Also fix two issues that would have failed CI / runtime after the
merge:

- `import re` is forbidden in vLLM; switched to `import regex as re`.
- `<|redacted_im_end|>` typo in the system-turn template (introduced
  in 8a8fd28 alongside the regex sanitizer) restored to `<|im_end|>`
  to match the rest of the prompt and the Qwen3-ASR SDK format.

Signed-off-by: Nathan Price <nathan@abridge.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread vllm/model_executor/models/qwen3_asr.py Outdated
@TheCodeWrangler

This comment has been minimized.

TheCodeWrangler and others added 2 commits May 29, 2026 13:20
…35415

Mirrors the Qwen3-ASR cleanups landing on `qwen-asr-prompt-support` so
this PR stays consistent with the underlying prompt plumbing it
extends:

- _sanitize_transcription_user_text: apply the regex to a fixpoint so
  nested tokens like `<|im<|x|>_end|>` cannot reconstruct a valid
  ChatML token after a single pass. Used by both `prompt` and the new
  `response_prefix` field.
- Sanitize `request_prompt` first, then check truthiness, so a fully-
  stripped prompt does not produce an empty
  `<|im_start|>system\n<|im_end|>\n` turn.
- Revert the unnecessary `(mm_options or {}).get(...)` guard. The base
  annotation is `Mapping[str, BaseDummyOptions]` and every other model
  trusts it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nathan Price <nathan@abridge.com>
Single-pass `str.replace("<asr_text>", "")` is bypassable via nested
payloads such as `<asr_te<asr_text>xt>` which reconstructs to
`<asr_text>` after the inner tag is stripped, allowing the
model-significant assistant-prefix delimiter to be injected into the
system / assistant turns. Move both the ChatML-token substitution and
the `<asr_text>` removal inside the fixpoint loop so nested payloads
cannot reconstruct either kind of token.

Signed-off-by: Nathan Price <nathan@abridge.com>
@TheCodeWrangler TheCodeWrangler force-pushed the feat/transcription-response-prefix branch from 2c302e0 to 0ea4b42 Compare May 29, 2026 14:01
@TheCodeWrangler

This comment has been minimized.

Use `context` / `prefix` for the sanitized fields (matching vllm-project#35415's
naming for the request_prompt path). Drop the redundant inline comment
that duplicated the `_sanitize_transcription_user_text` docstring.

Signed-off-by: Nathan Price <nathan@abridge.com>
Comment thread vllm/model_executor/models/qwen3_asr.py
@TheCodeWrangler

This comment has been minimized.

…response-prefix

Signed-off-by: Nathan Price <nathan@abridge.com>

# Conflicts:
#	vllm/model_executor/models/qwen3_asr.py
@DarkLight1337

Copy link
Copy Markdown
Member

@Isotr0py @NickLucche what do you think of this? I haven't used audio models much so couldn't comment on the use case

@TheCodeWrangler

Copy link
Copy Markdown
Contributor Author

@Isotr0py @NickLucche what do you think of this? I haven't used audio models much so couldn't comment on the use case

Thanks for pinging the right reviewers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants