perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy by aoshen02 · Pull Request #224 · vllm-project/vime

aoshen02 · 2026-06-10T00:30:08Z

Wrap agent-adapter tokenizer with vLLM's cached-tokenizer proxy

What

load_tokenizer now wraps the HF tokenizer with vllm.tokenizers.hf.get_cached_tokenizer (falling back gracefully if the vLLM-internal helper is unavailable).

Why

The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose constructors call get_vocab(). On a 248K-vocab tokenizer (Qwen3.6) that's ~150 ms per parser construction, paid on every model turn on the adapter event loop. vLLM's own engine already wraps every tokenizer this way; the adapter path was the one place that didn't.

This is a defensive optimization / footgun removal, not a fix for a proven regression. A 20-instance A/B on SWE-bench Verified (regression subset, t=1.0) showed no measurable difference between wrapped and unwrapped:

	wrapped	no-cache
mean trajectory time	509s	523s
timeouts (1800s budget)	0	0
solved	16/20	14/20 (within n=20 noise)

The per-call cost is small next to per-trajectory variance at this scale; the wrap removes the worst-case event-loop stall without changing behavior. The ~5pt gap originally attributed to this path is best explained as t=1.0 sampling noise.

Tests

tests/test_agent_adapters.py: +2 tests (wrap applied when helper present; tolerated when absent). 16/16 pass.

AI assistance

AI-assisted; human reviewed every line and ran the A/B comparison + unit tests.

gemini-code-assist

Code Review

This pull request introduces caching for vLLM reasoning and tool parser instances in vime/agent/parsing.py to avoid expensive, repeated calls to tokenizer.get_vocab() on every model turn, along with corresponding unit tests. The reviewer identified a critical issue with using id(tokenizer) as a cache key in global dictionaries, which can lead to memory leaks and cache collisions due to Python's id() reuse. The reviewer suggested using weakref.WeakKeyDictionary to safely cache parsers by tokenizer.

gemini-code-assist · 2026-06-10T00:31:00Z

+_REASONING_PARSER_CACHE: dict[tuple[str, int], Any] = {}
+_TOOL_PARSER_CACHE: dict[tuple[str, int, str], tuple[Any, Any]] = {}
+
+
+def _get_reasoning_parser(reasoning_parser_name: str, tokenizer):
+    from vllm.reasoning import ReasoningParserManager
+
+    key = (reasoning_parser_name, id(tokenizer))
+    parser = _REASONING_PARSER_CACHE.get(key)
+    if parser is None:
+        parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)
+        _REASONING_PARSER_CACHE[key] = parser
+    return parser
+
+
+def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]):
+    """Return a cached ``(parser, request)`` pair for this name/tokenizer/schema."""
+    from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
+    from vllm.tool_parsers import ToolParserManager
+
+    key = (tool_parser_name, id(tokenizer), json.dumps(tools_schema, sort_keys=True))
+    cached = _TOOL_PARSER_CACHE.get(key)
+    if cached is None:
+        request = ChatCompletionRequest(messages=[], tools=tools_schema)
+        parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
+        cached = (parser, request)
+        _TOOL_PARSER_CACHE[key] = cached
+    return cached


Using id(tokenizer) as a cache key in global dictionaries (_REASONING_PARSER_CACHE and _TOOL_PARSER_CACHE) introduces a memory leak and potential cache collisions:

Memory Leak: The global cache holds a strong reference to the parser (and request), which in turn holds a strong reference to the tokenizer. Because of this, any tokenizer instance passed to parse_model_output will never be garbage collected, leading to a memory leak in long-running processes.

ID Reuse: Python's id() is only guaranteed to be unique during the lifetime of an object. If a tokenizer is garbage collected and a new one is allocated at the same memory address, it will reuse the same id, leading to incorrect cache hits.

Using weakref.WeakKeyDictionary with the tokenizer as the key solves both issues. When the tokenizer is garbage collected, its corresponding cache entry is automatically removed.

import weakref _REASONING_PARSER_CACHE = weakref.WeakKeyDictionary() _TOOL_PARSER_CACHE = weakref.WeakKeyDictionary() def _get_reasoning_parser(reasoning_parser_name: str, tokenizer): from vllm.reasoning import ReasoningParserManager try: sub_cache = _REASONING_PARSER_CACHE.setdefault(tokenizer, {}) parser = sub_cache.get(reasoning_parser_name) if parser is None: parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer) sub_cache[reasoning_parser_name] = parser return parser except TypeError: return ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer) def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]): """Return a cached ``(parser, request)`` pair for this name/tokenizer/schema.""" from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest from vllm.tool_parsers import ToolParserManager schema_key = json.dumps(tools_schema, sort_keys=True) cache_key = (tool_parser_name, schema_key) try: sub_cache = _TOOL_PARSER_CACHE.setdefault(tokenizer, {}) cached = sub_cache.get(cache_key) if cached is None: request = ChatCompletionRequest(messages=[], tools=tools_schema) parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools) cached = (parser, request) sub_cache[cache_key] = cached return cached except TypeError: request = ChatCompletionRequest(messages=[], tools=tools_schema) parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools) return parser, request

aoshen02 · 2026-06-10T03:22:32Z

Targeted regression verification ✅

Re-ran the 20 stable-regression instances (solved in both sglang runs, failed in the pre-fix vLLM run) with this fix applied:

Parser	solved on the 20-instance regression set
vLLM pre-fix	0/20 (selection criterion)
vLLM with this fix	13/20 (65%)
Reference (official sglang dataset, unbiased)	15/20 (75%)

65% vs the 75% unbiased reference is within sampling noise at n=20 (t=1.0, σ≈10%); the 20/20 from the selection runs is selection-biased and not a fair baseline.

Strongest evidence is the latency profile: all 20 trajectories now complete in 157–658 s — the same instances took 410–2144 s under sglang and timed out entirely under the pre-fix vLLM parser. The failure mode (event-loop blocking → time-budget exhaustion) is gone; the 7 remaining failures are ordinary t=1.0 capability variance, not truncation.

Full-500 rerun for the headline number to follow.

…atency The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose constructors call tokenizer.get_vocab() to validate sentinel tokens. On a raw HF tokenizer with a large vocabulary (Qwen3.6: 248K entries) get_vocab() rebuilds the full dict every call — ~150 ms. The adapter builds a parser per model turn, so each turn paid this cost on the single asyncio event loop, blocking all concurrent rollout sessions. Measured on SWE-bench Verified 500 (Qwen3.6-35B-A3B, 4 nodes, 128-way rollout concurrency, fixed 1800s per-trajectory budget): - per-turn parser construction: 152 ms (raw tokenizer) vs 0.08 ms (wrapped) - end-to-end run: 122 min vs 80 min (+52%) with identical config - solved%: 66.5% vs 71.0% — the -4.5pt concentrated in 20 long trajectories (median 1370s elapsed) that the slowdown pushed past the time budget; parse *results* are byte-identical vLLM's own engine wraps every tokenizer with get_cached_tokenizer (vllm/tokenizers/hf.py), which precomputes get_vocab() — its docstring warns that transformers "recompute multiple tokenizer properties each time they are called, leading to a significant slowdown". Frameworks that route parsing through vLLM's OpenAIServingChat (SkyRL, NeMo-RL) inherit this for free; vime calls the parsers client-side with a raw AutoTokenizer, so it must wrap explicitly. Fix: wrap in load_tokenizer() so get_vocab() is precomputed once. This makes every downstream parser construction cheap without per-call caching, and any other consumer of the tokenizer benefits too. Falls back gracefully if vLLM's internal helper moves (it is not a public API). Regression-20 verification (the 20 instances that sglang solved twice and pre-fix vLLM failed): 0/20 -> 13/20, every trajectory now finishing in 157-658s instead of timing out. Co-Authored-By: Claude <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

aoshen02 force-pushed the fix/cache-vllm-parsers branch from 14ab462 to 7efdee3 Compare June 10, 2026 03:29

aoshen02 force-pushed the fix/cache-vllm-parsers branch from 7efdee3 to b3fcfc2 Compare June 10, 2026 03:46

aoshen02 changed the title ~~fix(agent/parsing): cache vLLM parser instances across turns (-5pt SWE-bench regression)~~ perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224

perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224
aoshen02 wants to merge 1 commit into
mainfrom
fix/cache-vllm-parsers

aoshen02 commented Jun 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

aoshen02 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aoshen02 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wrap agent-adapter tokenizer with vLLM's cached-tokenizer proxy

What

Why

Tests

AI assistance

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen02 commented Jun 10, 2026

Targeted regression verification ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aoshen02 commented Jun 10, 2026 •

edited

Loading