[FEATURE] Foreman v0.2 / v0.8.0: close the reviewer-step empirical gap

## What

Foreman v0.2 closes the reviewer-step empirical gap that v0.1 surfaced under load. Tracks the full work for the **v0.8.0 release**, which lands Foreman's reviewer architecture in a state that actually delivers the "drop a backlog, walk away, come back to merge-ready PRs" claim.

This epic is a planning + tracking issue. Each sub-task lands as its own PR.

## Why

The Memorial Day v5 batch (run 2026-05-25) plus a controlled reviewer-comparison experiment (same day) produced a clear empirical finding:

- The v0.1 pipeline's same-model reviewer (Qwen 3.6 35B-A3B reviewing Qwen 3.6 35B-A3B output) approves structurally-broken PRs.
- Different-family local reviewers at the 24-26B scale (Devstral, Gemma 4) also approve the same PRs.
- Tools, more context (TurboQuant 65K), structured deliberation prompts, and 3-of-3 unanimous ensemble all produce a 0/3 catch rate on three hand-curated structural bugs.
- A frontier cloud model (Claude, tool-using) catches 3/3.
- Falsification framing + PR-metadata redaction (on Devstral) and thinking-mode (on Qwen3) DO shift verdicts to REQUEST-CHANGES on different PRs each, for false-positive-ish reasons. Combined any-NO-GO escalation flags 3/3 PRs for human review without catching the actual bugs.

Full experiment writeup: `llmkube-internal/diagnostics/reviewer-comparison-2026-05-25/SUMMARY.md` (private repo; will be summarized publicly in a research-note blog post separate from this release).

The v0.2 architecture this implies:

1. **Multi-strategy local reviewer ensemble** (3 prompt strategies, any-NO-GO escalates) closes the gap for air-gap-only orgs in the "flag for human review" sense.
2. **Optional hybrid cloud reviewer** (sovereignty-toggled) closes the gap structurally for orgs that can use it.
3. **Pipeline reliability fixes** unblock the production reliability gaps that surfaced under load.

## How

Seven sub-tasks, each its own PR. Day-by-day execution plan with rough scoping:

### Reliability fixes (the bugs filed during the v5 dogfood)

- [ ] [#539](https://github.com/defilantech/LLMKube/issues/539) BashTool deadlock in `cmd.Wait` when LLM-issued bash spawns a grandchild that holds inherited pipes. Fix: `cmd.WaitDelay = 5s`, `cmd.Cancel = kill`, `Setpgid + kill process group on Unix`. Regression test with a deliberately-orphaning bash command.
- [ ] [#540](https://github.com/defilantech/LLMKube/issues/540) foreman-agent stale port override. Teach foreman-agent to resolve InferenceService endpoints dynamically (query metal-agent's `:9090` health for the current dynamic port). Remove the hardcoded `--inference-base-url-override` launchd flag.
- [ ] [#541](https://github.com/defilantech/LLMKube/issues/541) cascade + Workload rollup gate on `phase=Succeeded` only, ignoring verdict. Add `taskSucceededOnTarget(t bool)` helper; reuse in `AgenticTaskReconciler.cascadeFailIfDepFailed` and `WorkloadReconciler.succeededTasks` rollup.
- [ ] [#542](https://github.com/defilantech/LLMKube/issues/542) foreman-agent does not recover orphaned `phase=Running` tasks on restart. Startup hook in the watcher: list `assignedNode == myFleetNodeName && phase == Running`; reset to `Pending` with an `AgentRestartRecovery` condition.
- [x] [#543](https://github.com/defilantech/LLMKube/issues/543) executor downgrades reviewer Agent's APPROVE verdict to NO-GO no-diff. **DONE in PR #545.** Route reviewer-role GO through `modelDecidedResult` instead of `noChangesResult`.

### Reviewer architecture

- [ ] **WorkloadSpec.reviewerAgentRefs (plural)**. Schema extension: `[]corev1.LocalObjectReference` replacing the singular `reviewerAgentRef`. Reconciler emits `review-N-strategy-A`, `review-N-strategy-B`, `review-N-strategy-C` for each issue, all `dependsOn` the `verify-N` task. Aggregate logic: any single REQUEST-CHANGES across the strategies → human review queue label; unanimous APPROVE → mergeable label.
- [ ] **Three reviewer Agent CRs** for the v0.2 ship set, using the system prompts already drafted during the empirical experiment:
  - `validator-reviewer` (current `reviewer.md`)
  - `falsification-reviewer` (the new `reviewer-falsification.md`, PR-metadata redacted by the executor when this Agent dispatches)
  - `thinking-reviewer` (Qwen3 with thinking-mode enabled on the Agent's `InferenceServiceRef` server-side flags)
- [ ] **Hybrid cloud reviewer Agent (optional, sovereignty-gated)**. New `Agent.spec.externalEndpoint` field with `secretRef` for an API key. New `Agent.spec.sovereignty` enum: `"local-only"` (default) or `"cloud-ok"`. Executor refuses to dispatch to an external endpoint if any owning Workload or referenced AgenticTask requires `local-only`. WorkloadReconciler emits the cloud reviewer task ONLY when at least one local strategy returned REQUEST-CHANGES (escalation), so cloud cost is bounded.

### Three-step pipeline on heterogeneous hardware (M5 proper)

- [ ] **Devstral via LLMKube on Mac Studio**. Model + InferenceService CRs in a new `foreman-reviewer` namespace on shadowstack. Mac Studio metal-agent installed with `--namespace=foreman-reviewer` so it doesn't conflict with the M5 Max metal-agent watching `default`.
- [ ] **foreman-agent on Mac Studio (cortex)**. launchd unit with `--roles=reviewer`, kubeconfig, git auth, workspace dir. FleetNode registration verified. Reviewer Agent's `requiredCapability.roles: [reviewer]` matches.
- [ ] **V4 end-to-end demo**. Pipeline of code@M5Max → verify@shadowstack → review@cortex executes on a real issue end to end. Documented in `docs/foreman/runbook-m4.md` updates.

### Validation + ship

- [ ] **Expanded empirical validation**. Re-run the v0.2 reviewer architecture against the original v5 PRs (#449, #506, #510) + 5-10 newly curated PRs with known structural bugs from upstream and/or other public repos. Target: hybrid cloud catches >70% of structural bugs (matching Claude's 3/3 on the hand-curated set); local-only ensemble escalates 100% of PRs containing real bugs to human review; false-positive rate on a clean control set <30%.
- [ ] **Documentation**: `docs/foreman/v0.2-reviewer-architecture.md` (three-strategy ensemble + cloud escalation + sovereignty toggle), `docs/foreman/hybrid-cloud-setup.md` (runbook), updated `docs/foreman/runbook-m4.md` to the v0.2 architecture.
- [ ] **release-please cut for v0.8.0** with prominent release notes featuring the empirical evidence + the architecture decision.

## Explicitly deferred to v0.3+

- Static analysis pre-flight (semgrep / ast-grep rules for known patterns). High value for *known* bug classes, but hybrid cloud catches the same class via LLM reasoning, which generalizes. v0.3 engineering work.
- [#544](https://github.com/defilantech/LLMKube/issues/544) stuck-loop detector. Useful for v0.3 production hardening.
- Per-Agent Prometheus metrics + Grafana dashboard. Production observability; not announce-worthy for v0.2.
- Transcript retention TTL / archive.
- LLM-driven planner (M6 proper, autonomous intent decomposition). v0.2's stub planner stays.
- Multi-cluster story (multi-site L4 edge fleet pattern). v0.4+ aligned with the design partner's timeline.

## Checklist

- [ ] All sub-task PRs merged
- [ ] Empirical validation results saved under `llmkube-internal/diagnostics/v0.2-validation-...`
- [ ] Public-facing docs updated (reviewer architecture, hybrid cloud setup, sovereignty toggle)
- [ ] release-please cuts v0.8.0
- [ ] Companion research-note blog post (`llmkube.com/blog`) explains the empirical foundation for the architecture



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Foreman v0.2 / v0.8.0: close the reviewer-step empirical gap #546

What

Why

How

Reliability fixes (the bugs filed during the v5 dogfood)

Reviewer architecture

Three-step pipeline on heterogeneous hardware (M5 proper)

Validation + ship

Explicitly deferred to v0.3+

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] Foreman v0.2 / v0.8.0: close the reviewer-step empirical gap #546

Description

What

Why

How

Reliability fixes (the bugs filed during the v5 dogfood)

Reviewer architecture

Three-step pipeline on heterogeneous hardware (M5 proper)

Validation + ship

Explicitly deferred to v0.3+

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions