Skip to content

fix(clients): capture inline <think> reasoning in vLLM + Ollama#113

Merged
antoinezambelli merged 1 commit into
mainfrom
az/vllm-reasoning-parity
Jun 20, 2026
Merged

fix(clients): capture inline <think> reasoning in vLLM + Ollama#113
antoinezambelli merged 1 commit into
mainfrom
az/vllm-reasoning-parity

Conversation

@antoinezambelli

Copy link
Copy Markdown
Owner

What

Fixes #110. Reasoning models that emit their thinking inline in content (e.g. <think>...</think>) instead of a structured field had it silently dropped by VLLMClient.send() — no REASONING message downstream, breaking reasoning replay. This brings vLLM and Ollama to reasoning-parity with the reference LlamafileClient, reusing the shared forge.prompts.think_tags.extract_think_tags helper (from #112).

Changes

  • vLLM: _resolve_reasoning reworked to a single (reasoning, content) form used identically by send() and send_stream() — chain: structured reasoning field → <think> extraction → raw content. <think> stripped from TextResponse in both paths.
  • Ollama: extract <think> tags when the structured thinking field is absent; <think> stripped from TextResponse in both paths.

Reasoning capture stays gated on self._think; TextResponse tag-stripping is unconditional (matches llamafile).

Scope

vLLM + Ollama only. OpenAICompatClient (which currently drops reasoning on tool calls entirely), anthropic, and the proxy layer are out of scope and tracked separately.

Known limitation

Raw <think> text is still visible in live streaming TEXT_DELTA chunks before the FINAL response strips it — consistent with LlamafileClient, and the returned response object is clean. Filtering across stream-chunk boundaries is non-trivial and deferred.

Testing

+10 regression tests (6 non-streaming, 4 streaming) covering inline-<think> capture, structured-field precedence, text stripping, and think=False discard. test_vllm_client.py + test_ollama_client.py: 98 passed; full tests/unit suite green.

Closes #110.

🤖 Generated with Claude Code

https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a

Reasoning models emit chain-of-thought either in a structured field or
inline in content (often <think>...</think>). VLLMClient.send() only read
the structured `reasoning` field, so when a model put its thinking inline
it was silently dropped -- no REASONING message downstream, breaking
reasoning replay. OllamaClient had a raw-content fallback but never
extracted <think> tags.

Bring both clients to parity with LlamafileClient using the shared
forge.prompts.think_tags.extract_think_tags helper:

- vLLM: rework _resolve_reasoning to a single (reasoning, content) form
  used identically by send() and send_stream() -- structured field ->
  <think> extraction -> raw content. Strip <think> from TextResponse in
  both paths.
- Ollama: extract <think> tags when the structured `thinking` field is
  absent; strip <think> from TextResponse in both paths.

Reasoning capture stays gated on self._think; TextResponse tag-stripping
is unconditional, matching LlamafileClient.

Adds 10 regression tests (6 non-streaming, 4 streaming).

Closes #110.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a
@antoinezambelli antoinezambelli merged commit cc7462d into main Jun 20, 2026
2 checks passed
@antoinezambelli antoinezambelli deleted the az/vllm-reasoning-parity branch June 20, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VLLMClient.send() loses reasoning content when content field carries thinking text alongside tool_calls

1 participant