perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224
perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224aoshen02 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces caching for vLLM reasoning and tool parser instances in vime/agent/parsing.py to avoid expensive, repeated calls to tokenizer.get_vocab() on every model turn, along with corresponding unit tests. The reviewer identified a critical issue with using id(tokenizer) as a cache key in global dictionaries, which can lead to memory leaks and cache collisions due to Python's id() reuse. The reviewer suggested using weakref.WeakKeyDictionary to safely cache parsers by tokenizer.
| _REASONING_PARSER_CACHE: dict[tuple[str, int], Any] = {} | ||
| _TOOL_PARSER_CACHE: dict[tuple[str, int, str], tuple[Any, Any]] = {} | ||
|
|
||
|
|
||
| def _get_reasoning_parser(reasoning_parser_name: str, tokenizer): | ||
| from vllm.reasoning import ReasoningParserManager | ||
|
|
||
| key = (reasoning_parser_name, id(tokenizer)) | ||
| parser = _REASONING_PARSER_CACHE.get(key) | ||
| if parser is None: | ||
| parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer) | ||
| _REASONING_PARSER_CACHE[key] = parser | ||
| return parser | ||
|
|
||
|
|
||
| def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]): | ||
| """Return a cached ``(parser, request)`` pair for this name/tokenizer/schema.""" | ||
| from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest | ||
| from vllm.tool_parsers import ToolParserManager | ||
|
|
||
| key = (tool_parser_name, id(tokenizer), json.dumps(tools_schema, sort_keys=True)) | ||
| cached = _TOOL_PARSER_CACHE.get(key) | ||
| if cached is None: | ||
| request = ChatCompletionRequest(messages=[], tools=tools_schema) | ||
| parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools) | ||
| cached = (parser, request) | ||
| _TOOL_PARSER_CACHE[key] = cached | ||
| return cached |
There was a problem hiding this comment.
Using id(tokenizer) as a cache key in global dictionaries (_REASONING_PARSER_CACHE and _TOOL_PARSER_CACHE) introduces a memory leak and potential cache collisions:
- Memory Leak: The global cache holds a strong reference to the
parser(andrequest), which in turn holds a strong reference to thetokenizer. Because of this, anytokenizerinstance passed toparse_model_outputwill never be garbage collected, leading to a memory leak in long-running processes. - ID Reuse: Python's
id()is only guaranteed to be unique during the lifetime of an object. If a tokenizer is garbage collected and a new one is allocated at the same memory address, it will reuse the sameid, leading to incorrect cache hits.
Using weakref.WeakKeyDictionary with the tokenizer as the key solves both issues. When the tokenizer is garbage collected, its corresponding cache entry is automatically removed.
import weakref
_REASONING_PARSER_CACHE = weakref.WeakKeyDictionary()
_TOOL_PARSER_CACHE = weakref.WeakKeyDictionary()
def _get_reasoning_parser(reasoning_parser_name: str, tokenizer):
from vllm.reasoning import ReasoningParserManager
try:
sub_cache = _REASONING_PARSER_CACHE.setdefault(tokenizer, {})
parser = sub_cache.get(reasoning_parser_name)
if parser is None:
parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)
sub_cache[reasoning_parser_name] = parser
return parser
except TypeError:
return ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)
def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]):
"""Return a cached ``(parser, request)`` pair for this name/tokenizer/schema."""
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.tool_parsers import ToolParserManager
schema_key = json.dumps(tools_schema, sort_keys=True)
cache_key = (tool_parser_name, schema_key)
try:
sub_cache = _TOOL_PARSER_CACHE.setdefault(tokenizer, {})
cached = sub_cache.get(cache_key)
if cached is None:
request = ChatCompletionRequest(messages=[], tools=tools_schema)
parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
cached = (parser, request)
sub_cache[cache_key] = cached
return cached
except TypeError:
request = ChatCompletionRequest(messages=[], tools=tools_schema)
parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
return parser, request
Targeted regression verification ✅Re-ran the 20 stable-regression instances (solved in both sglang runs, failed in the pre-fix vLLM run) with this fix applied:
65% vs the 75% unbiased reference is within sampling noise at n=20 (t=1.0, σ≈10%); the 20/20 from the selection runs is selection-biased and not a fair baseline. Strongest evidence is the latency profile: all 20 trajectories now complete in 157–658 s — the same instances took 410–2144 s under sglang and timed out entirely under the pre-fix vLLM parser. The failure mode (event-loop blocking → time-budget exhaustion) is gone; the 7 remaining failures are ordinary t=1.0 capability variance, not truncation. Full-500 rerun for the headline number to follow. |
14ab462 to
7efdee3
Compare
…atency The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose constructors call tokenizer.get_vocab() to validate sentinel tokens. On a raw HF tokenizer with a large vocabulary (Qwen3.6: 248K entries) get_vocab() rebuilds the full dict every call — ~150 ms. The adapter builds a parser per model turn, so each turn paid this cost on the single asyncio event loop, blocking all concurrent rollout sessions. Measured on SWE-bench Verified 500 (Qwen3.6-35B-A3B, 4 nodes, 128-way rollout concurrency, fixed 1800s per-trajectory budget): - per-turn parser construction: 152 ms (raw tokenizer) vs 0.08 ms (wrapped) - end-to-end run: 122 min vs 80 min (+52%) with identical config - solved%: 66.5% vs 71.0% — the -4.5pt concentrated in 20 long trajectories (median 1370s elapsed) that the slowdown pushed past the time budget; parse *results* are byte-identical vLLM's own engine wraps every tokenizer with get_cached_tokenizer (vllm/tokenizers/hf.py), which precomputes get_vocab() — its docstring warns that transformers "recompute multiple tokenizer properties each time they are called, leading to a significant slowdown". Frameworks that route parsing through vLLM's OpenAIServingChat (SkyRL, NeMo-RL) inherit this for free; vime calls the parsers client-side with a raw AutoTokenizer, so it must wrap explicitly. Fix: wrap in load_tokenizer() so get_vocab() is precomputed once. This makes every downstream parser construction cheap without per-call caching, and any other consumer of the tokenizer benefits too. Falls back gracefully if vLLM's internal helper moves (it is not a public API). Regression-20 verification (the 20 instances that sglang solved twice and pre-fix vLLM failed): 0/20 -> 13/20, every trajectory now finishing in 157-658s instead of timing out. Co-Authored-By: Claude <noreply@anthropic.com>
7efdee3 to
b3fcfc2
Compare
Wrap agent-adapter tokenizer with vLLM's cached-tokenizer proxy
What
load_tokenizernow wraps the HF tokenizer withvllm.tokenizers.hf.get_cached_tokenizer(falling back gracefully if the vLLM-internal helper is unavailable).Why
The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose constructors call
get_vocab(). On a 248K-vocab tokenizer (Qwen3.6) that's ~150 ms per parser construction, paid on every model turn on the adapter event loop. vLLM's own engine already wraps every tokenizer this way; the adapter path was the one place that didn't.This is a defensive optimization / footgun removal, not a fix for a proven regression. A 20-instance A/B on SWE-bench Verified (regression subset, t=1.0) showed no measurable difference between wrapped and unwrapped:
The per-call cost is small next to per-trajectory variance at this scale; the wrap removes the worst-case event-loop stall without changing behavior. The ~5pt gap originally attributed to this path is best explained as t=1.0 sampling noise.
Tests
tests/test_agent_adapters.py: +2 tests (wrap applied when helper present; tolerated when absent). 16/16 pass.AI assistance
AI-assisted; human reviewed every line and ran the A/B comparison + unit tests.