What
Foreman v0.2 closes the reviewer-step empirical gap that v0.1 surfaced under load. Tracks the full work for the v0.8.0 release, which lands Foreman's reviewer architecture in a state that actually delivers the "drop a backlog, walk away, come back to merge-ready PRs" claim.
This epic is a planning + tracking issue. Each sub-task lands as its own PR.
Why
The Memorial Day v5 batch (run 2026-05-25) plus a controlled reviewer-comparison experiment (same day) produced a clear empirical finding:
- The v0.1 pipeline's same-model reviewer (Qwen 3.6 35B-A3B reviewing Qwen 3.6 35B-A3B output) approves structurally-broken PRs.
- Different-family local reviewers at the 24-26B scale (Devstral, Gemma 4) also approve the same PRs.
- Tools, more context (TurboQuant 65K), structured deliberation prompts, and 3-of-3 unanimous ensemble all produce a 0/3 catch rate on three hand-curated structural bugs.
- A frontier cloud model (Claude, tool-using) catches 3/3.
- Falsification framing + PR-metadata redaction (on Devstral) and thinking-mode (on Qwen3) DO shift verdicts to REQUEST-CHANGES on different PRs each, for false-positive-ish reasons. Combined any-NO-GO escalation flags 3/3 PRs for human review without catching the actual bugs.
Full experiment writeup: llmkube-internal/diagnostics/reviewer-comparison-2026-05-25/SUMMARY.md (private repo; will be summarized publicly in a research-note blog post separate from this release).
The v0.2 architecture this implies:
- Multi-strategy local reviewer ensemble (3 prompt strategies, any-NO-GO escalates) closes the gap for air-gap-only orgs in the "flag for human review" sense.
- Optional hybrid cloud reviewer (sovereignty-toggled) closes the gap structurally for orgs that can use it.
- Pipeline reliability fixes unblock the production reliability gaps that surfaced under load.
How
Seven sub-tasks, each its own PR. Day-by-day execution plan with rough scoping:
Reliability fixes (the bugs filed during the v5 dogfood)
Reviewer architecture
Three-step pipeline on heterogeneous hardware (M5 proper)
Validation + ship
Explicitly deferred to v0.3+
- Static analysis pre-flight (semgrep / ast-grep rules for known patterns). High value for known bug classes, but hybrid cloud catches the same class via LLM reasoning, which generalizes. v0.3 engineering work.
- #544 stuck-loop detector. Useful for v0.3 production hardening.
- Per-Agent Prometheus metrics + Grafana dashboard. Production observability; not announce-worthy for v0.2.
- Transcript retention TTL / archive.
- LLM-driven planner (M6 proper, autonomous intent decomposition). v0.2's stub planner stays.
- Multi-cluster story (multi-site L4 edge fleet pattern). v0.4+ aligned with the design partner's timeline.
Checklist
What
Foreman v0.2 closes the reviewer-step empirical gap that v0.1 surfaced under load. Tracks the full work for the v0.8.0 release, which lands Foreman's reviewer architecture in a state that actually delivers the "drop a backlog, walk away, come back to merge-ready PRs" claim.
This epic is a planning + tracking issue. Each sub-task lands as its own PR.
Why
The Memorial Day v5 batch (run 2026-05-25) plus a controlled reviewer-comparison experiment (same day) produced a clear empirical finding:
Full experiment writeup:
llmkube-internal/diagnostics/reviewer-comparison-2026-05-25/SUMMARY.md(private repo; will be summarized publicly in a research-note blog post separate from this release).The v0.2 architecture this implies:
How
Seven sub-tasks, each its own PR. Day-by-day execution plan with rough scoping:
Reliability fixes (the bugs filed during the v5 dogfood)
cmd.Waitwhen LLM-issued bash spawns a grandchild that holds inherited pipes. Fix:cmd.WaitDelay = 5s,cmd.Cancel = kill,Setpgid + kill process group on Unix. Regression test with a deliberately-orphaning bash command.:9090health for the current dynamic port). Remove the hardcoded--inference-base-url-overridelaunchd flag.phase=Succeededonly, ignoring verdict. AddtaskSucceededOnTarget(t bool)helper; reuse inAgenticTaskReconciler.cascadeFailIfDepFailedandWorkloadReconciler.succeededTasksrollup.phase=Runningtasks on restart. Startup hook in the watcher: listassignedNode == myFleetNodeName && phase == Running; reset toPendingwith anAgentRestartRecoverycondition.modelDecidedResultinstead ofnoChangesResult.Reviewer architecture
[]corev1.LocalObjectReferencereplacing the singularreviewerAgentRef. Reconciler emitsreview-N-strategy-A,review-N-strategy-B,review-N-strategy-Cfor each issue, alldependsOntheverify-Ntask. Aggregate logic: any single REQUEST-CHANGES across the strategies → human review queue label; unanimous APPROVE → mergeable label.validator-reviewer(currentreviewer.md)falsification-reviewer(the newreviewer-falsification.md, PR-metadata redacted by the executor when this Agent dispatches)thinking-reviewer(Qwen3 with thinking-mode enabled on the Agent'sInferenceServiceRefserver-side flags)Agent.spec.externalEndpointfield withsecretReffor an API key. NewAgent.spec.sovereigntyenum:"local-only"(default) or"cloud-ok". Executor refuses to dispatch to an external endpoint if any owning Workload or referenced AgenticTask requireslocal-only. WorkloadReconciler emits the cloud reviewer task ONLY when at least one local strategy returned REQUEST-CHANGES (escalation), so cloud cost is bounded.Three-step pipeline on heterogeneous hardware (M5 proper)
foreman-reviewernamespace on shadowstack. Mac Studio metal-agent installed with--namespace=foreman-reviewerso it doesn't conflict with the M5 Max metal-agent watchingdefault.--roles=reviewer, kubeconfig, git auth, workspace dir. FleetNode registration verified. Reviewer Agent'srequiredCapability.roles: [reviewer]matches.docs/foreman/runbook-m4.mdupdates.Validation + ship
docs/foreman/v0.2-reviewer-architecture.md(three-strategy ensemble + cloud escalation + sovereignty toggle),docs/foreman/hybrid-cloud-setup.md(runbook), updateddocs/foreman/runbook-m4.mdto the v0.2 architecture.Explicitly deferred to v0.3+
Checklist
llmkube-internal/diagnostics/v0.2-validation-...llmkube.com/blog) explains the empirical foundation for the architecture