fix(clients): capture inline <think> reasoning in vLLM + Ollama#113
Merged
Conversation
Reasoning models emit chain-of-thought either in a structured field or inline in content (often <think>...</think>). VLLMClient.send() only read the structured `reasoning` field, so when a model put its thinking inline it was silently dropped -- no REASONING message downstream, breaking reasoning replay. OllamaClient had a raw-content fallback but never extracted <think> tags. Bring both clients to parity with LlamafileClient using the shared forge.prompts.think_tags.extract_think_tags helper: - vLLM: rework _resolve_reasoning to a single (reasoning, content) form used identically by send() and send_stream() -- structured field -> <think> extraction -> raw content. Strip <think> from TextResponse in both paths. - Ollama: extract <think> tags when the structured `thinking` field is absent; strip <think> from TextResponse in both paths. Reasoning capture stays gated on self._think; TextResponse tag-stripping is unconditional, matching LlamafileClient. Adds 10 regression tests (6 non-streaming, 4 streaming). Closes #110. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #110. Reasoning models that emit their thinking inline in
content(e.g.<think>...</think>) instead of a structured field had it silently dropped byVLLMClient.send()— no REASONING message downstream, breaking reasoning replay. This brings vLLM and Ollama to reasoning-parity with the referenceLlamafileClient, reusing the sharedforge.prompts.think_tags.extract_think_tagshelper (from #112).Changes
_resolve_reasoningreworked to a single(reasoning, content)form used identically bysend()andsend_stream()— chain: structuredreasoningfield →<think>extraction → raw content.<think>stripped fromTextResponsein both paths.<think>tags when the structuredthinkingfield is absent;<think>stripped fromTextResponsein both paths.Reasoning capture stays gated on
self._think;TextResponsetag-stripping is unconditional (matches llamafile).Scope
vLLM + Ollama only.
OpenAICompatClient(which currently drops reasoning on tool calls entirely),anthropic, and the proxy layer are out of scope and tracked separately.Known limitation
Raw
<think>text is still visible in live streamingTEXT_DELTAchunks before theFINALresponse strips it — consistent withLlamafileClient, and the returned response object is clean. Filtering across stream-chunk boundaries is non-trivial and deferred.Testing
+10 regression tests (6 non-streaming, 4 streaming) covering inline-
<think>capture, structured-field precedence, text stripping, andthink=Falsediscard.test_vllm_client.py+test_ollama_client.py: 98 passed; fulltests/unitsuite green.Closes #110.
🤖 Generated with Claude Code
https://claude.ai/code/session_01EpuVYCYeb1DhWfynCVyA6a