Skip to content

perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224

Open
aoshen02 wants to merge 1 commit into
mainfrom
fix/cache-vllm-parsers
Open

perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy#224
aoshen02 wants to merge 1 commit into
mainfrom
fix/cache-vllm-parsers

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Wrap agent-adapter tokenizer with vLLM's cached-tokenizer proxy

What

load_tokenizer now wraps the HF tokenizer with vllm.tokenizers.hf.get_cached_tokenizer (falling back gracefully if the vLLM-internal helper is unavailable).

Why

The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose constructors call get_vocab(). On a 248K-vocab tokenizer (Qwen3.6) that's ~150 ms per parser construction, paid on every model turn on the adapter event loop. vLLM's own engine already wraps every tokenizer this way; the adapter path was the one place that didn't.

This is a defensive optimization / footgun removal, not a fix for a proven regression. A 20-instance A/B on SWE-bench Verified (regression subset, t=1.0) showed no measurable difference between wrapped and unwrapped:

wrapped no-cache
mean trajectory time 509s 523s
timeouts (1800s budget) 0 0
solved 16/20 14/20 (within n=20 noise)

The per-call cost is small next to per-trajectory variance at this scale; the wrap removes the worst-case event-loop stall without changing behavior. The ~5pt gap originally attributed to this path is best explained as t=1.0 sampling noise.

Tests

tests/test_agent_adapters.py: +2 tests (wrap applied when helper present; tolerated when absent). 16/16 pass.

AI assistance

AI-assisted; human reviewed every line and ran the A/B comparison + unit tests.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces caching for vLLM reasoning and tool parser instances in vime/agent/parsing.py to avoid expensive, repeated calls to tokenizer.get_vocab() on every model turn, along with corresponding unit tests. The reviewer identified a critical issue with using id(tokenizer) as a cache key in global dictionaries, which can lead to memory leaks and cache collisions due to Python's id() reuse. The reviewer suggested using weakref.WeakKeyDictionary to safely cache parsers by tokenizer.

Comment thread vime/agent/parsing.py Outdated
Comment on lines +21 to +48
_REASONING_PARSER_CACHE: dict[tuple[str, int], Any] = {}
_TOOL_PARSER_CACHE: dict[tuple[str, int, str], tuple[Any, Any]] = {}


def _get_reasoning_parser(reasoning_parser_name: str, tokenizer):
from vllm.reasoning import ReasoningParserManager

key = (reasoning_parser_name, id(tokenizer))
parser = _REASONING_PARSER_CACHE.get(key)
if parser is None:
parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)
_REASONING_PARSER_CACHE[key] = parser
return parser


def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]):
"""Return a cached ``(parser, request)`` pair for this name/tokenizer/schema."""
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.tool_parsers import ToolParserManager

key = (tool_parser_name, id(tokenizer), json.dumps(tools_schema, sort_keys=True))
cached = _TOOL_PARSER_CACHE.get(key)
if cached is None:
request = ChatCompletionRequest(messages=[], tools=tools_schema)
parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
cached = (parser, request)
_TOOL_PARSER_CACHE[key] = cached
return cached

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using id(tokenizer) as a cache key in global dictionaries (_REASONING_PARSER_CACHE and _TOOL_PARSER_CACHE) introduces a memory leak and potential cache collisions:

  1. Memory Leak: The global cache holds a strong reference to the parser (and request), which in turn holds a strong reference to the tokenizer. Because of this, any tokenizer instance passed to parse_model_output will never be garbage collected, leading to a memory leak in long-running processes.
  2. ID Reuse: Python's id() is only guaranteed to be unique during the lifetime of an object. If a tokenizer is garbage collected and a new one is allocated at the same memory address, it will reuse the same id, leading to incorrect cache hits.

Using weakref.WeakKeyDictionary with the tokenizer as the key solves both issues. When the tokenizer is garbage collected, its corresponding cache entry is automatically removed.

import weakref

_REASONING_PARSER_CACHE = weakref.WeakKeyDictionary()
_TOOL_PARSER_CACHE = weakref.WeakKeyDictionary()


def _get_reasoning_parser(reasoning_parser_name: str, tokenizer):
    from vllm.reasoning import ReasoningParserManager

    try:
        sub_cache = _REASONING_PARSER_CACHE.setdefault(tokenizer, {})
        parser = sub_cache.get(reasoning_parser_name)
        if parser is None:
            parser = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)
            sub_cache[reasoning_parser_name] = parser
        return parser
    except TypeError:
        return ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)(tokenizer)


def _get_tool_parser(tool_parser_name: str, tokenizer, tools_schema: list[dict]):
    """Return a cached ``(parser, request)`` pair for this name/tokenizer/schema."""
    from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
    from vllm.tool_parsers import ToolParserManager

    schema_key = json.dumps(tools_schema, sort_keys=True)
    cache_key = (tool_parser_name, schema_key)
    try:
        sub_cache = _TOOL_PARSER_CACHE.setdefault(tokenizer, {})
        cached = sub_cache.get(cache_key)
        if cached is None:
            request = ChatCompletionRequest(messages=[], tools=tools_schema)
            parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
            cached = (parser, request)
            sub_cache[cache_key] = cached
        return cached
    except TypeError:
        request = ChatCompletionRequest(messages=[], tools=tools_schema)
        parser = ToolParserManager.get_tool_parser(tool_parser_name)(tokenizer, tools=request.tools)
        return parser, request

@aoshen02

Copy link
Copy Markdown
Collaborator Author

Targeted regression verification ✅

Re-ran the 20 stable-regression instances (solved in both sglang runs, failed in the pre-fix vLLM run) with this fix applied:

Parser solved on the 20-instance regression set
vLLM pre-fix 0/20 (selection criterion)
vLLM with this fix 13/20 (65%)
Reference (official sglang dataset, unbiased) 15/20 (75%)

65% vs the 75% unbiased reference is within sampling noise at n=20 (t=1.0, σ≈10%); the 20/20 from the selection runs is selection-biased and not a fair baseline.

Strongest evidence is the latency profile: all 20 trajectories now complete in 157–658 s — the same instances took 410–2144 s under sglang and timed out entirely under the pre-fix vLLM parser. The failure mode (event-loop blocking → time-budget exhaustion) is gone; the 7 remaining failures are ordinary t=1.0 capability variance, not truncation.

Full-500 rerun for the headline number to follow.

@aoshen02 aoshen02 force-pushed the fix/cache-vllm-parsers branch from 14ab462 to 7efdee3 Compare June 10, 2026 03:29
…atency

The agent adapter hands its tokenizer to vLLM tool/reasoning parsers, whose
constructors call tokenizer.get_vocab() to validate sentinel tokens. On a raw
HF tokenizer with a large vocabulary (Qwen3.6: 248K entries) get_vocab()
rebuilds the full dict every call — ~150 ms. The adapter builds a parser per
model turn, so each turn paid this cost on the single asyncio event loop,
blocking all concurrent rollout sessions.

Measured on SWE-bench Verified 500 (Qwen3.6-35B-A3B, 4 nodes, 128-way rollout
concurrency, fixed 1800s per-trajectory budget):

- per-turn parser construction: 152 ms (raw tokenizer) vs 0.08 ms (wrapped)
- end-to-end run: 122 min vs 80 min (+52%) with identical config
- solved%: 66.5% vs 71.0% — the -4.5pt concentrated in 20 long trajectories
  (median 1370s elapsed) that the slowdown pushed past the time budget; parse
  *results* are byte-identical

vLLM's own engine wraps every tokenizer with get_cached_tokenizer
(vllm/tokenizers/hf.py), which precomputes get_vocab() — its docstring warns
that transformers "recompute multiple tokenizer properties each time they are
called, leading to a significant slowdown". Frameworks that route parsing
through vLLM's OpenAIServingChat (SkyRL, NeMo-RL) inherit this for free;
vime calls the parsers client-side with a raw AutoTokenizer, so it must wrap
explicitly.

Fix: wrap in load_tokenizer() so get_vocab() is precomputed once. This makes
every downstream parser construction cheap without per-call caching, and any
other consumer of the tokenizer benefits too. Falls back gracefully if vLLM's
internal helper moves (it is not a public API).

Regression-20 verification (the 20 instances that sglang solved twice and
pre-fix vLLM failed): 0/20 -> 13/20, every trajectory now finishing in
157-658s instead of timing out.

Co-Authored-By: Claude <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the fix/cache-vllm-parsers branch from 7efdee3 to b3fcfc2 Compare June 10, 2026 03:46
@aoshen02 aoshen02 changed the title fix(agent/parsing): cache vLLM parser instances across turns (-5pt SWE-bench regression) perf(agent): wrap adapter tokenizer with vLLM cached-tokenizer proxy Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant