fix(clients): capture inline <think> reasoning in vLLM + Ollama by antoinezambelli · Pull Request #113 · antoinezambelli/forge

antoinezambelli · 2026-06-20T07:32:30Z

What

Fixes #110. Reasoning models that emit their thinking inline in content (e.g. <think>...</think>) instead of a structured field had it silently dropped by VLLMClient.send() — no REASONING message downstream, breaking reasoning replay. This brings vLLM and Ollama to reasoning-parity with the reference LlamafileClient, reusing the shared forge.prompts.think_tags.extract_think_tags helper (from #112).

Changes

vLLM: _resolve_reasoning reworked to a single (reasoning, content) form used identically by send() and send_stream() — chain: structured reasoning field → <think> extraction → raw content. <think> stripped from TextResponse in both paths.
Ollama: extract <think> tags when the structured thinking field is absent; <think> stripped from TextResponse in both paths.

Reasoning capture stays gated on self._think; TextResponse tag-stripping is unconditional (matches llamafile).

Scope

vLLM + Ollama only. OpenAICompatClient (which currently drops reasoning on tool calls entirely), anthropic, and the proxy layer are out of scope and tracked separately.

Known limitation

Raw <think> text is still visible in live streaming TEXT_DELTA chunks before the FINAL response strips it — consistent with LlamafileClient, and the returned response object is clean. Filtering across stream-chunk boundaries is non-trivial and deferred.

Testing

+10 regression tests (6 non-streaming, 4 streaming) covering inline-<think> capture, structured-field precedence, text stripping, and think=False discard. test_vllm_client.py + test_ollama_client.py: 98 passed; full tests/unit suite green.

Closes #110.

🤖 Generated with Claude Code

https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a

Reasoning models emit chain-of-thought either in a structured field or inline in content (often <think>...</think>). VLLMClient.send() only read the structured `reasoning` field, so when a model put its thinking inline it was silently dropped -- no REASONING message downstream, breaking reasoning replay. OllamaClient had a raw-content fallback but never extracted <think> tags. Bring both clients to parity with LlamafileClient using the shared forge.prompts.think_tags.extract_think_tags helper: - vLLM: rework _resolve_reasoning to a single (reasoning, content) form used identically by send() and send_stream() -- structured field -> <think> extraction -> raw content. Strip <think> from TextResponse in both paths. - Ollama: extract <think> tags when the structured `thinking` field is absent; strip <think> from TextResponse in both paths. Reasoning capture stays gated on self._think; TextResponse tag-stripping is unconditional, matching LlamafileClient. Adds 10 regression tests (6 non-streaming, 4 streaming). Closes #110. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a

antoinezambelli mentioned this pull request Jun 20, 2026

OpenAICompatClient drops reasoning content on tool calls #114

Open

antoinezambelli merged commit cc7462d into main Jun 20, 2026
2 checks passed

antoinezambelli deleted the az/vllm-reasoning-parity branch June 20, 2026 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clients): capture inline <think> reasoning in vLLM + Ollama#113

fix(clients): capture inline <think> reasoning in vLLM + Ollama#113
antoinezambelli merged 1 commit into
mainfrom
az/vllm-reasoning-parity

antoinezambelli commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented Jun 20, 2026

What

Changes

Scope

Known limitation

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant