feat: --hybrid flag routes file-analyzer batches to local Gemma via Ollama (~4-5× cost reduction)#176
Conversation
… via Ollama Adds hybrid_runner.py and a new --hybrid flag to the /understand skill, routing the expensive file-analyzer extraction phase to a local Gemma model (default: gemma4:26b-a4b) via Ollama instead of Claude subagents. Reduces Claude API cost by ~4-5× on medium/large repos by keeping architectural reasoning on Claude and offloading structural extraction to a local model. Changes: - skills/understand/hybrid_runner.py (new): standalone Python script that runs extract-structure.mjs (unchanged), sends batch data to Ollama, and applies post-processing (import edge injection from batchImportData, weight normalization, deduplication). Graceful fallback on Gemma failure. - skills/understand/SKILL.md: adds --hybrid flag docs, Phase 0 Ollama reachability check (HYBRID_MODE), Phase 2 hybrid dispatch path (5 parallel background python processes writing batch-N.json) Design notes: - project-scanner and assemble-reviewer stay on Claude (low cost / need judgment) - Only imports edges are stripped and reinjected from batchImportData to prevent Gemma hallucinating import relationships (confirmed via calibration probe) - Configure via OLLAMA_HOST, OLLAMA_MODEL, HYBRID_TIMEOUT env vars - Tested on gemma4:26b-a4b; compatible with any Ollama-hosted model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Sounds good, can this run faster or it just reduce the token usage in your use case? @codex review this |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97a75401f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except urllib.error.HTTPError as e: | ||
| body = e.read().decode()[:500] | ||
| raise RuntimeError(f"Ollama HTTP {e.code}: {body}") from e | ||
| except TimeoutError: | ||
| raise RuntimeError(f"Ollama request timed out after {timeout}s") from None |
There was a problem hiding this comment.
Catch URLError and preserve fallback behavior
call_ollama only converts HTTPError/TimeoutError to RuntimeError, but urllib.request.urlopen raises URLError for common failures like connection refused, DNS failures, and many timeout cases. In those cases run_analyze's except RuntimeError block is skipped, so the process exits with an uncaught exception instead of writing the documented empty batch-<N>.json fallback, which can break hybrid runs when Ollama becomes unavailable mid-analysis.
Useful? React with 👍 / 👎.
| n_type = node.get("type", "file") | ||
| if n_type not in VALID_NODE_TYPES: | ||
| node["type"] = "file" # fallback |
There was a problem hiding this comment.
Populate missing node type before emitting batch output
The normalization path only fixes type when it is invalid, not when it is missing: node.get("type", "file") makes the check pass while leaving node["type"] unset. If Gemma omits type for a node, that node is emitted without a required schema field and will be dropped later during graph validation, causing avoidable node loss in hybrid mode.
Useful? React with 👍 / 👎.
|
if i increase the concurrent batches from 5 to 10, will taht speed things up?
|
…ng node type)
P1: call_ollama only caught HTTPError/TimeoutError, so urllib.error.URLError
(connection refused when Ollama is down, DNS failures, socket-level timeouts)
escaped run_analyze's `except RuntimeError` and crashed the hybrid run instead
of writing the documented empty batch-N.json fallback. Catch URLError and wrap
it as RuntimeError so graceful fallback works as advertised.
P2: node `type` normalization used `node.get("type", "file")` to test validity
but only wrote back on an invalid value — a node with a *missing* type key
passed the check while node["type"] stayed unset, causing it to be dropped at
schema validation (silent node loss). Write the resolved type back
unconditionally.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@Lum1104 Good question — it's primarily a cost play, not a speed one. The ~4-5× is token/API cost, not wall-clock. On most setups it's actually a touch slower: the 5 file-analyzer batches all hit a single local Ollama instance, so a large model on one GPU serializes generation, whereas standard mode runs those batches with Claude's cloud parallelism. So: meaningfully cheaper, comparable-to-slightly-slower on speed depending on your local hardware. If latency (not cost) is the goal, standard mode is still the faster path. Also worth noting the fix I just pushed ( |
|
@simkimsia In hybrid mode, no — and it can make things slightly worse. All batches queue against a single Ollama endpoint, so throughput is bound by the GPU and In standard (Claude) mode, raising concurrency does help — up to your API rate limits — since each batch is an independent cloud request. |
Resolve SKILL.md conflicts in Phase 1 / Phase 2 sections: - Keep both upstream's `[Phase 1/7] Scanning...` progress line and the PR's hybrid-mode Phase 1 note. - Keep the PR's HYBRID_MODE batch-dispatch block; adopt upstream's reworded file-analyzer dispatch line and prompt-template intro (now references batches.json[i]). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
--hybridflag to/understandskill that routes the expensive file-analyzer extraction phase to a local Gemma model via Ollama instead of Claude subagentshybrid_runner.py— a self-contained Python orchestrator that handles tree-sitter extraction + Gemma semantic analysis + schema-safe post-processingHow it works
Routing table:
Key design decisions
Why only file-analyzer? It's ~80% of total token cost — runs in 5 concurrent batches of 20-30 files each. The other agents are either low-cost or require Claude-level reasoning quality.
Import edge injection: Gemma tends to re-resolve imports from source instead of strictly using
batchImportData.hybrid_runner.pystrips all Gemma-emittedimportsedges and reinserts them deterministically frombatchImportData(pre-resolved by project-scanner). This makes import graphs identical to standard mode regardless of model quality.Graceful fallback: If Gemma fails (timeout, parse error), the runner writes an empty
batch-N.json. The pipeline continues; assemble-reviewer notes the gap and merge-batch-graphs.py handles it.Safe for existing users:
HYBRID_MODE=falseby default. No behavior change unless--hybridis explicitly passed.Requirements
gemma4:26b-a4b, any model ≥13B should work)extract-structure.mjsscript already bundled with the skill runs unchangedTested on
gemma4:26b-a4bvia Ollamamerge-batch-graphs.pynormalizationTest plan
python3 hybrid_runner.py --helpshows scan/analyze subcommands/understand --hybridon a small repo (20-50 files) — verifybatch-*.jsonwrittenOLLAMA_MODELenv var respected🤖 Generated with Claude Code