Skip to content

feat: --hybrid flag routes file-analyzer batches to local Gemma via Ollama (~4-5× cost reduction)#176

Open
Itsthewayofyou wants to merge 3 commits into
Egonex-AI:mainfrom
Itsthewayofyou:feat/hybrid-gemma-routing
Open

feat: --hybrid flag routes file-analyzer batches to local Gemma via Ollama (~4-5× cost reduction)#176
Itsthewayofyou wants to merge 3 commits into
Egonex-AI:mainfrom
Itsthewayofyou:feat/hybrid-gemma-routing

Conversation

@Itsthewayofyou

Copy link
Copy Markdown

Summary

  • Adds --hybrid flag to /understand skill that routes the expensive file-analyzer extraction phase to a local Gemma model via Ollama instead of Claude subagents
  • Introduces hybrid_runner.py — a self-contained Python orchestrator that handles tree-sitter extraction + Gemma semantic analysis + schema-safe post-processing
  • Reduces Claude API cost by ~4-5× on medium/large repos. Architecture-reasoning phases (architecture-analyzer, tour-builder, domain-analyzer) stay on Claude.

How it works

# Standard (unchanged)
/understand

# Hybrid — file-analyzer batches go to local Gemma
/understand --hybrid

# Configure the model/endpoint
OLLAMA_MODEL=gemma4:26b-a4b OLLAMA_HOST=http://localhost:11434 /understand --hybrid

Routing table:

Phase Standard Hybrid
project-scanner Claude Claude (low cost, dynamic script)
file-analyzer ×N batches Claude Gemma via Ollama
assemble-reviewer Claude Claude (semantic judgment)
architecture-analyzer Claude Claude
tour-builder Claude Claude
domain-analyzer Claude Claude

Key design decisions

Why only file-analyzer? It's ~80% of total token cost — runs in 5 concurrent batches of 20-30 files each. The other agents are either low-cost or require Claude-level reasoning quality.

Import edge injection: Gemma tends to re-resolve imports from source instead of strictly using batchImportData. hybrid_runner.py strips all Gemma-emitted imports edges and reinserts them deterministically from batchImportData (pre-resolved by project-scanner). This makes import graphs identical to standard mode regardless of model quality.

Graceful fallback: If Gemma fails (timeout, parse error), the runner writes an empty batch-N.json. The pipeline continues; assemble-reviewer notes the gap and merge-batch-graphs.py handles it.

Safe for existing users: HYBRID_MODE=false by default. No behavior change unless --hybrid is explicitly passed.

Requirements

  • Ollama running locally with a capable model (tested: gemma4:26b-a4b, any model ≥13B should work)
  • Python 3 (stdlib only — no new dependencies)
  • The extract-structure.mjs script already bundled with the skill runs unchanged

Tested on

  • gemma4:26b-a4b via Ollama
  • Calibration: 100% node ID prefix correctness, 100% edge type vocabulary, JSON validity ✅
  • Output flows correctly through merge-batch-graphs.py normalization

Test plan

  • python3 hybrid_runner.py --help shows scan/analyze subcommands
  • Run /understand --hybrid on a small repo (20-50 files) — verify batch-*.json written
  • Compare assembled graph node/edge counts between standard and hybrid modes
  • Verify OLLAMA_MODEL env var respected
  • Verify Ollama-down fallback logs warning and continues

🤖 Generated with Claude Code

… via Ollama

Adds hybrid_runner.py and a new --hybrid flag to the /understand skill,
routing the expensive file-analyzer extraction phase to a local Gemma model
(default: gemma4:26b-a4b) via Ollama instead of Claude subagents. Reduces
Claude API cost by ~4-5× on medium/large repos by keeping architectural
reasoning on Claude and offloading structural extraction to a local model.

Changes:
- skills/understand/hybrid_runner.py (new): standalone Python script that
  runs extract-structure.mjs (unchanged), sends batch data to Ollama, and
  applies post-processing (import edge injection from batchImportData,
  weight normalization, deduplication). Graceful fallback on Gemma failure.
- skills/understand/SKILL.md: adds --hybrid flag docs, Phase 0 Ollama
  reachability check (HYBRID_MODE), Phase 2 hybrid dispatch path (5
  parallel background python processes writing batch-N.json)

Design notes:
- project-scanner and assemble-reviewer stay on Claude (low cost / need judgment)
- Only imports edges are stripped and reinjected from batchImportData to prevent
  Gemma hallucinating import relationships (confirmed via calibration probe)
- Configure via OLLAMA_HOST, OLLAMA_MODEL, HYBRID_TIMEOUT env vars
- Tested on gemma4:26b-a4b; compatible with any Ollama-hosted model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Lum1104

Lum1104 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Sounds good, can this run faster or it just reduce the token usage in your use case?

@codex review this

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97a75401f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +101 to +105
except urllib.error.HTTPError as e:
body = e.read().decode()[:500]
raise RuntimeError(f"Ollama HTTP {e.code}: {body}") from e
except TimeoutError:
raise RuntimeError(f"Ollama request timed out after {timeout}s") from None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch URLError and preserve fallback behavior

call_ollama only converts HTTPError/TimeoutError to RuntimeError, but urllib.request.urlopen raises URLError for common failures like connection refused, DNS failures, and many timeout cases. In those cases run_analyze's except RuntimeError block is skipped, so the process exits with an uncaught exception instead of writing the documented empty batch-<N>.json fallback, which can break hybrid runs when Ollama becomes unavailable mid-analysis.

Useful? React with 👍 / 👎.

Comment on lines +166 to +168
n_type = node.get("type", "file")
if n_type not in VALID_NODE_TYPES:
node["type"] = "file" # fallback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate missing node type before emitting batch output

The normalization path only fixes type when it is invalid, not when it is missing: node.get("type", "file") makes the check pass while leaving node["type"] unset. If Gemma omits type for a node, that node is emitted without a required schema field and will be dropped later during graph validation, causing avoidable node loss in hybrid mode.

Useful? React with 👍 / 👎.

@simkimsia

Copy link
Copy Markdown

if i increase the concurrent batches from 5 to 10, will taht speed things up?

Why only file-analyzer? It's ~80% of total token cost — runs in 5 concurrent batches of 20-30 files each. The other agents are either low-cost or require Claude-level reasoning quality.

…ng node type)

P1: call_ollama only caught HTTPError/TimeoutError, so urllib.error.URLError
(connection refused when Ollama is down, DNS failures, socket-level timeouts)
escaped run_analyze's `except RuntimeError` and crashed the hybrid run instead
of writing the documented empty batch-N.json fallback. Catch URLError and wrap
it as RuntimeError so graceful fallback works as advertised.

P2: node `type` normalization used `node.get("type", "file")` to test validity
but only wrote back on an invalid value — a node with a *missing* type key
passed the check while node["type"] stayed unset, causing it to be dropped at
schema validation (silent node loss). Write the resolved type back
unconditionally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Itsthewayofyou

Copy link
Copy Markdown
Author

@Lum1104 Good question — it's primarily a cost play, not a speed one. The ~4-5× is token/API cost, not wall-clock. On most setups it's actually a touch slower: the 5 file-analyzer batches all hit a single local Ollama instance, so a large model on one GPU serializes generation, whereas standard mode runs those batches with Claude's cloud parallelism. So: meaningfully cheaper, comparable-to-slightly-slower on speed depending on your local hardware. If latency (not cost) is the goal, standard mode is still the faster path.

Also worth noting the fix I just pushed (a84b942): a connection-refused/DNS failure now correctly falls back to an empty batch instead of crashing the run, and a node emitted without a type no longer gets silently dropped at validation.

@Itsthewayofyou

Copy link
Copy Markdown
Author

@simkimsia In hybrid mode, no — and it can make things slightly worse. All batches queue against a single Ollama endpoint, so throughput is bound by the GPU and OLLAMA_NUM_PARALLEL, not by how many batches you fan out. Sending 10 instead of 5 just deepens the queue and adds memory/context pressure on a large model. The levers that actually move hybrid speed are the local model size/quant, OLLAMA_NUM_PARALLEL, and your GPU.

In standard (Claude) mode, raising concurrency does help — up to your API rate limits — since each batch is an independent cloud request.

Resolve SKILL.md conflicts in Phase 1 / Phase 2 sections:
- Keep both upstream's `[Phase 1/7] Scanning...` progress line and the
  PR's hybrid-mode Phase 1 note.
- Keep the PR's HYBRID_MODE batch-dispatch block; adopt upstream's
  reworded file-analyzer dispatch line and prompt-template intro
  (now references batches.json[i]).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants