[Feature] Add response_prefix parameter to audio transcription/translation endpoints#36018
[Feature] Add response_prefix parameter to audio transcription/translation endpoints#36018TheCodeWrangler wants to merge 23 commits into
response_prefix parameter to audio transcription/translation endpoints#36018Conversation
3167f9b to
a1c5918
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a response_prefix parameter to the audio transcription and translation endpoints, enabling callers to guide the model's generation and implement streaming-like functionality. The changes are well-structured, updating the API protocol, plumbing the new parameter through the serving layer, and modifying the SupportsTranscription interface. The core logic is implemented for the Qwen3-ASR model, while other models are updated for interface compatibility. Additionally, the request_prompt is now correctly used as a system message for Qwen3-ASR. However, the implementation in the Qwen3-ASR model is vulnerable to prompt injection as it directly concatenates user-supplied strings into a structured prompt without sanitizing model-specific control tokens. This could allow an attacker to manipulate the model's behavior by injecting arbitrary turns into the prompt.
a1c5918 to
c2f1670
Compare
|
tested locally. Results looked the way I expected. The response picks up from where the response prefix finishes (even if the response prefix started with words prior to the audio segment) The prompt guided the model to corrected transcription of "phonograph" curl -s -X POST http://localhost:8000/v1/audio/transcriptions -F "file=@mary_had_lamb.ogg" -F "model=Qwen/Qwen3-ASR-1.7B" -F "language=en" -F "response_prefix=A bunch of text that may have occurred prior to this audio starting. One of the first wo
rds" -F "prompt=A phonograph plays audio" | python3 -m json.tool
{
"text": " I spoke in the original phonograph, a little piece of practical poetry. Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.",
"usage": {
"type": "duration",
"seconds": 16
}
} |
|
Hi @TheCodeWrangler, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This comment has been minimized.
This comment has been minimized.
|
Hi @TheCodeWrangler, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @TheCodeWrangler, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…slation endpoints Adds an optional `response_prefix` field to both `/v1/audio/transcriptions` and `/v1/audio/translations`. When provided, the text is injected into the assistant turn of the prompt so the model continues generation from that point, as if it had already produced those words. This enables callers to implement streaming-quality transcription over the existing REST API without requiring WebSocket infrastructure: send the growing audio buffer with the stable prefix from the previous segment (minus rollback tokens), and the model picks up where it left off. For Qwen3-ASR, the prefix is appended after the `<asr_text>` tag in the assistant turn, matching the prompt shape used by the Qwen3-ASR SDK's `streaming_transcribe`. The `request_prompt` field is also now wired as a system message in the ChatML template. All other ASR model implementations accept the new parameter as a no-op with a default of `""`, so this is fully backward-compatible. Implements Option A from RFC vllm-project#35908. Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor
5a6fd3b to
4064588
Compare
Two fixes from PR vllm-project#35415: 1. The language hint in the assistant prefix now correctly uses `language` (source audio language) for transcription and `to_language` (target language) for translation. Previously only `to_language` was checked, so plain transcription with an explicit language never passed the hint to the model. 2. Guard `mm_options` with `or {}` to prevent AttributeError when it is None during engine initialization / encoder cache profiling. Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Replace per-call token list with a regex that strips any <|...|> fragments from prompt and response_prefix, addressing prompt-injection review feedback for the transcription API. Signed-off-by: Nathan Price <nathan@abridge.com> Made-with: Cursor
|
Update (prompt injection): |
Resolve conflict in speech-to-text protocol files after main split the combined `openai/speech_to_text/protocol.py` into `speech_to_text/transcription/protocol.py` and `speech_to_text/translation/protocol.py`. Re-applied this PR's `response_prefix` field and `build_stt_params` plumbing on both new locations, keeping the `SpeechToTextParams.response_prefix` addition from this branch. Also fix two issues that would have failed CI / runtime after the merge: - `import re` is forbidden in vLLM; switched to `import regex as re`. - `<|redacted_im_end|>` typo in the system-turn template (introduced in 8a8fd28 alongside the regex sanitizer) restored to `<|im_end|>` to match the rest of the prompt and the Qwen3-ASR SDK format. Signed-off-by: Nathan Price <nathan@abridge.com> Co-authored-by: Cursor <cursoragent@cursor.com>
This comment has been minimized.
This comment has been minimized.
…35415 Mirrors the Qwen3-ASR cleanups landing on `qwen-asr-prompt-support` so this PR stays consistent with the underlying prompt plumbing it extends: - _sanitize_transcription_user_text: apply the regex to a fixpoint so nested tokens like `<|im<|x|>_end|>` cannot reconstruct a valid ChatML token after a single pass. Used by both `prompt` and the new `response_prefix` field. - Sanitize `request_prompt` first, then check truthiness, so a fully- stripped prompt does not produce an empty `<|im_start|>system\n<|im_end|>\n` turn. - Revert the unnecessary `(mm_options or {}).get(...)` guard. The base annotation is `Mapping[str, BaseDummyOptions]` and every other model trusts it. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nathan Price <nathan@abridge.com>
Single-pass `str.replace("<asr_text>", "")` is bypassable via nested
payloads such as `<asr_te<asr_text>xt>` which reconstructs to
`<asr_text>` after the inner tag is stripped, allowing the
model-significant assistant-prefix delimiter to be injected into the
system / assistant turns. Move both the ChatML-token substitution and
the `<asr_text>` removal inside the fixpoint loop so nested payloads
cannot reconstruct either kind of token.
Signed-off-by: Nathan Price <nathan@abridge.com>
2c302e0 to
0ea4b42
Compare
This comment has been minimized.
This comment has been minimized.
Use `context` / `prefix` for the sanitized fields (matching vllm-project#35415's naming for the request_prompt path). Drop the redundant inline comment that duplicated the `_sanitize_transcription_user_text` docstring. Signed-off-by: Nathan Price <nathan@abridge.com>
This comment has been minimized.
This comment has been minimized.
…response-prefix Signed-off-by: Nathan Price <nathan@abridge.com> # Conflicts: # vllm/model_executor/models/qwen3_asr.py
|
@Isotr0py @NickLucche what do you think of this? I haven't used audio models much so couldn't comment on the use case |
Thanks for pinging the right reviewers! |
Summary
response_prefixfield to both/v1/audio/transcriptionsand/v1/audio/translationsendpointsThis implements Option A from RFC #35908.
How it works
For Qwen3-ASR,
response_prefixis appended after the<asr_text>tag in the assistant turn, matching the prompt shape used by the Qwen3-ASR SDK'sstreaming_transcribe:A caller implements the streaming loop as:
The caller handles audio accumulation, rollback, and prefix capping — the server just runs a single transcription with the given prompt.
Changes
vllm/config/speech_to_text.pyresponse_prefix: str = ""toSpeechToTextParamsvllm/entrypoints/speech_to_text/transcription/protocol.pyresponse_prefixtoTranscriptionRequestand pass it throughvllm/entrypoints/speech_to_text/translation/protocol.pyresponse_prefixtoTranslationRequestand pass it throughvllm/model_executor/models/qwen3_asr.pyresponse_prefixand append after<asr_text>tag in the assistant turnresponse_prefixrides on the existingSpeechToTextParamsdataclass plumbing (the refactor from #36268), so noSupportsTranscriptioninterface change is needed and other ASR backends pick up the new field as a no-op default""automatically.Companion PR
This sits on top of the same Qwen3-ASR prompt-handling surface as #35415 (
prompt/request_prompt). The prompt sanitizer (_sanitize_transcription_user_text) is shared between the two —response_prefixis sanitized identically so a malicious prefix cannot escape the assistant role and inject control tokens. After #35415 lands, this PR rebases to a much smaller diff.Related
Test plan
response_prefix)response_prefixis accepted and changes model outputpre-run-checkis the label gate, not a code failure)Notes