Skip to content

[FEATURE] Foreman v0.2 / v0.8.0: close the reviewer-step empirical gap #546

@Defilan

Description

@Defilan

What

Foreman v0.2 closes the reviewer-step empirical gap that v0.1 surfaced under load. Tracks the full work for the v0.8.0 release, which lands Foreman's reviewer architecture in a state that actually delivers the "drop a backlog, walk away, come back to merge-ready PRs" claim.

This epic is a planning + tracking issue. Each sub-task lands as its own PR.

Why

The Memorial Day v5 batch (run 2026-05-25) plus a controlled reviewer-comparison experiment (same day) produced a clear empirical finding:

  • The v0.1 pipeline's same-model reviewer (Qwen 3.6 35B-A3B reviewing Qwen 3.6 35B-A3B output) approves structurally-broken PRs.
  • Different-family local reviewers at the 24-26B scale (Devstral, Gemma 4) also approve the same PRs.
  • Tools, more context (TurboQuant 65K), structured deliberation prompts, and 3-of-3 unanimous ensemble all produce a 0/3 catch rate on three hand-curated structural bugs.
  • A frontier cloud model (Claude, tool-using) catches 3/3.
  • Falsification framing + PR-metadata redaction (on Devstral) and thinking-mode (on Qwen3) DO shift verdicts to REQUEST-CHANGES on different PRs each, for false-positive-ish reasons. Combined any-NO-GO escalation flags 3/3 PRs for human review without catching the actual bugs.

Full experiment writeup: llmkube-internal/diagnostics/reviewer-comparison-2026-05-25/SUMMARY.md (private repo; will be summarized publicly in a research-note blog post separate from this release).

The v0.2 architecture this implies:

  1. Multi-strategy local reviewer ensemble (3 prompt strategies, any-NO-GO escalates) closes the gap for air-gap-only orgs in the "flag for human review" sense.
  2. Optional hybrid cloud reviewer (sovereignty-toggled) closes the gap structurally for orgs that can use it.
  3. Pipeline reliability fixes unblock the production reliability gaps that surfaced under load.

How

Seven sub-tasks, each its own PR. Day-by-day execution plan with rough scoping:

Reliability fixes (the bugs filed during the v5 dogfood)

  • #539 BashTool deadlock in cmd.Wait when LLM-issued bash spawns a grandchild that holds inherited pipes. Fix: cmd.WaitDelay = 5s, cmd.Cancel = kill, Setpgid + kill process group on Unix. Regression test with a deliberately-orphaning bash command.
  • #540 foreman-agent stale port override. Teach foreman-agent to resolve InferenceService endpoints dynamically (query metal-agent's :9090 health for the current dynamic port). Remove the hardcoded --inference-base-url-override launchd flag.
  • #541 cascade + Workload rollup gate on phase=Succeeded only, ignoring verdict. Add taskSucceededOnTarget(t bool) helper; reuse in AgenticTaskReconciler.cascadeFailIfDepFailed and WorkloadReconciler.succeededTasks rollup.
  • #542 foreman-agent does not recover orphaned phase=Running tasks on restart. Startup hook in the watcher: list assignedNode == myFleetNodeName && phase == Running; reset to Pending with an AgentRestartRecovery condition.
  • #543 executor downgrades reviewer Agent's APPROVE verdict to NO-GO no-diff. DONE in PR fix(foreman/executor): route reviewer-role GO through modelDecidedResult #545. Route reviewer-role GO through modelDecidedResult instead of noChangesResult.

Reviewer architecture

  • WorkloadSpec.reviewerAgentRefs (plural). Schema extension: []corev1.LocalObjectReference replacing the singular reviewerAgentRef. Reconciler emits review-N-strategy-A, review-N-strategy-B, review-N-strategy-C for each issue, all dependsOn the verify-N task. Aggregate logic: any single REQUEST-CHANGES across the strategies → human review queue label; unanimous APPROVE → mergeable label.
  • Three reviewer Agent CRs for the v0.2 ship set, using the system prompts already drafted during the empirical experiment:
    • validator-reviewer (current reviewer.md)
    • falsification-reviewer (the new reviewer-falsification.md, PR-metadata redacted by the executor when this Agent dispatches)
    • thinking-reviewer (Qwen3 with thinking-mode enabled on the Agent's InferenceServiceRef server-side flags)
  • Hybrid cloud reviewer Agent (optional, sovereignty-gated). New Agent.spec.externalEndpoint field with secretRef for an API key. New Agent.spec.sovereignty enum: "local-only" (default) or "cloud-ok". Executor refuses to dispatch to an external endpoint if any owning Workload or referenced AgenticTask requires local-only. WorkloadReconciler emits the cloud reviewer task ONLY when at least one local strategy returned REQUEST-CHANGES (escalation), so cloud cost is bounded.

Three-step pipeline on heterogeneous hardware (M5 proper)

  • Devstral via LLMKube on Mac Studio. Model + InferenceService CRs in a new foreman-reviewer namespace on shadowstack. Mac Studio metal-agent installed with --namespace=foreman-reviewer so it doesn't conflict with the M5 Max metal-agent watching default.
  • foreman-agent on Mac Studio (cortex). launchd unit with --roles=reviewer, kubeconfig, git auth, workspace dir. FleetNode registration verified. Reviewer Agent's requiredCapability.roles: [reviewer] matches.
  • V4 end-to-end demo. Pipeline of code@M5Max → verify@shadowstack → review@cortex executes on a real issue end to end. Documented in docs/foreman/runbook-m4.md updates.

Validation + ship

Explicitly deferred to v0.3+

  • Static analysis pre-flight (semgrep / ast-grep rules for known patterns). High value for known bug classes, but hybrid cloud catches the same class via LLM reasoning, which generalizes. v0.3 engineering work.
  • #544 stuck-loop detector. Useful for v0.3 production hardening.
  • Per-Agent Prometheus metrics + Grafana dashboard. Production observability; not announce-worthy for v0.2.
  • Transcript retention TTL / archive.
  • LLM-driven planner (M6 proper, autonomous intent decomposition). v0.2's stub planner stays.
  • Multi-cluster story (multi-site L4 edge fleet pattern). v0.4+ aligned with the design partner's timeline.

Checklist

  • All sub-task PRs merged
  • Empirical validation results saved under llmkube-internal/diagnostics/v0.2-validation-...
  • Public-facing docs updated (reviewer architecture, hybrid cloud setup, sovereignty toggle)
  • release-please cuts v0.8.0
  • Companion research-note blog post (llmkube.com/blog) explains the empirical foundation for the architecture

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions