Skip to content

Latest commit

 

History

History
1830 lines (1404 loc) · 180 KB

File metadata and controls

1830 lines (1404 loc) · 180 KB

luxe — session resume document

⇒ SESSION HANDOFF (2026-06-01) — CLI robustness & task-type auto-detection improvements SHIPPED (gated preflight checks, expanded programming verbs); commit local

TL;DR for a cold start. Fixed CLI instability on read-only tasks and improved task-type auto-detection:

  1. Gated Preflight: Gated git and auth preflight checks (assert_gh_auth(), assert_clean_tree(), and plan_branch_name()) in src/luxe/pr.py to write tasks only. Read-only tasks successfully pass preflight without raising auth or dirty-tree errors, enabling offline/local reviews on dirty repositories.
  2. Robust Keyword Inference: Expanded task-type auto-detection in src/luxe/cli.py (_infer_task_type) with common programming verbs (like refactor, rewrite, optimize, patch, resolve, configure, and comment) to map them correctly to write tasks instead of falling back to read-only review tasks.
  3. Graceful Console Output: Updated cli.py to display (none) if the planned branch name is empty (default for read-only tasks).
  4. Validation: Added 3 unit tests in tests/test_pr_flow.py for task type auto-detection and gated preflight behaviour. Full test suite is completely passing (1259 passed, 6 skipped, 2 warnings in 30.49s).

⇒ PREVIOUS SESSION HANDOFF (2026-06-01) — Interactive luxe chat overhaul SHIPPED (REPL + compare + memory + opt-in model slots); commit 280675a, pushed; deployed to m5 + neo

TL;DR for a cold start. First user-facing interface work in a long time. Additive Claude-CLI-style overhaul; the existing one-shot luxe maintain, the benchmark harness, and the deterministic run_agent loop are byte-identical to before. Four new capabilities, all opt-in / default-preserving:

  1. luxe chat — interactive REPL. Each turn = exactly one run_single call; conversation state lives in the REPL (transcript-fold into goal= + a tagged extra_context block), never forks run_agent. Live tool output via the existing on_tool_event seam, markdown answers, footer (slot·model·write-mode·steps·tokens·swaps). Slash cmds /model /use /write /memory /compare /resume /clear. Ctrl-C aborts at next tool boundary + saves partial transcript. Read-only tools by default; /write toggles.
  2. Model slots (config.py SlotConfig/ChatSlots/model_for_slot): opt-in chat/plan/code model selection. Default (no slots: / empty model_key) → champion everywhere, byte-identical to single_64gb. The sanctioned exception to the single-champion / no-fan-out invariant (carve-out line in luxe.sdd). Distinct slot models → sequential weight swap (unload_all_loaded+thermal_guard), instrumented.
  3. Compare (luxe compare run/review, /compare): 3 modes — (1) luxe-vs-bare substrate ablation (os.environ save/restore disables compaction + interventions + baseline prompt), (2) two prompt variants, (3) vs another model. Sequential, blind + vote + free-text rationale → ~/.luxe/compare/<id>/ votes.jsonl, replayable. Reuses benchmark Variant + make_overlay.
  4. Memory (src/luxe/memory/): ~/.luxe/sessions/<id>/ transcripts (resume; gc keep-50/30d) + curated-first project memory (repo .luxe/memory.md always injected; auto facts unpromoted until /memory promote). Never reads ~/.claude/ or repo CLAUDE.md.

Load-bearing invariants honored. Memory/history inject ONLY via the new run_single(extra_context="") seam (default "" = byte-identical; benchmark/ maintain pass nothing). backend.py stream/on_token gated — default request body byte-identical, on_token inert when stream=False (asserted); the loop still calls non-stream (streaming is infrastructure only). Summarizer (chat/summarize.py) is non-model, deterministic, versioned (trunc-v1). Context precedence: current turn > project memory > conversation summary.

New packages: src/luxe/{chat,compare,memory}/, each with its own .sdd. Dep: prompt_toolkit as optional [chat] extra (pip install -e .[chat]; degrades to input()). Tests: 76 new incl. determinism byte-identity gates; full suite 1199 passed (the only skip/error is test_mlx_direct_smoke.py, which needs the optional mlx native module — pre-existing, env-gated).

Deployed 2026-06-01 to m5, m1, and neoluxe symlinked into /opt/homebrew/bin on each (just type luxe); OMLX_API_KEY added to ~/.zshrc on m1+neo. m5 + m1 run the 35B champion via oMLX:8000 (model cached, key authenticates). neo runs the micro-mind champion (Qwen2.5-1.5B-Instruct-Q8_0 GGUF — the 35B won't fit neo RAM) via llama-server on :8080, NOT oMLX (oMLX can't serve GGUF; luxe's Backend is OpenAI-compatible so it points straight at :8080). neo's configs/chat.yaml is pinned via git update-index --skip-worktree (omlx_base_url→:8080, model→the GGUF id, num_ctx 8192, repeat_penalty 1.1, max_tokens 2048). llama-server runs as launchd agent com.micromind.llama-server.plist (RunAtLoad + KeepAlive; flags from micro-mind's validated config + --repeat-penalty 1.1 server-side since luxe sends it under extra_body, which llama.cpp ignores). Validated the model-slots design in the wild: same luxe substrate, a 1.5B brain on the low-RAM box. See ~/Downloads/ micro-mind for the champion's provenance (neo-llm-bench, 2026-05-14).

Deferred (flagged in-plan): KV-preserving multi-turn (needs run_agent(seed_messages=)), token-level streaming into the loop, model-based summarizer. Plan: ~/.claude/plans/crispy-juggling-starlight.md. Memory: project_luxe_chat_interactive_overhaul.md.


⇒ SESSION HANDOFF (2026-05-28) — Forge-hybrid cycle CLOSED; TieredCompact ships DEFAULT-ON at phase_thresholds=(0.50, 0.85, 0.95); B+D refuted, banked in-tree default-OFF

TL;DR for a cold start. The forge-hybrid cycle (~/.claude/plans/starry-hopping-phoenix.md, executed 2026-05-26 → 2026-05-28) ran 4 axis ports from forge and closed with A shipping default-ON and B+D refuted at smoke. Compaction is now default-ON for ALL run_agent callers (SWE-bench, maintain_suite, BFCL) — LUXE_TIERED_COMPACT defaults to enabled; TieredCompact._DEFAULT_PHASE_THRESHOLDS = (0.50, 0.85, 0.95). Set LUXE_TIERED_COMPACT=0 for ablation. This is the cycle's only Pareto-positive lever: resolves equivalent to no-compaction baseline (within substrate noise ±2.8 at n=75 across 2 reps), wall reduced 42-56%, 2 protected wrong_target instances healed (matplotlib-25775, pylint-6528). Phase 3 (B) respond-terminal tool: 0/14 organic adoption + 0/14 with explicit prompt guidance → champion ignores the lever; infra in-tree default-OFF behind LUXE_RESPOND_TERMINAL. Phase 4 (D) trajectory-shape suppression: locked predicate (sustained_low_trend≥3 AND grep_vs_read_ratio<0.5 AND breadth_saturation<0.6) fired 0/14 at smoke → too narrow for this champion at num_ctx=32768; infra in-tree default-OFF behind LUXE_EARLY_BAIL_TRAJECTORY_SHAPE. Phase 5 trivial = A solo (already validated). Shipped + pushed across 9 commits (4581d389be486c).

Cycle's load-bearing design finding: phase-1-helps / phase-3-hurts. A single-knob compact_threshold can't capture the trade-off; the forge-style phase_thresholds tuple decouples aggressiveness per phase. Aggressive phase 1 (50% pressure, drops nudges + truncates tool_results) HEALS protected wrong_target instances; conservative phase 3 (95% pressure, drops reasoning content) avoids the destructive mode (observed at 1/75 firing rate at this tuning). Portable insight for future compaction-like levers.

Substrate non-determinism is the cycle's interpretive framework (separately banked in lessons.md 2026-05-26 + memory project_substrate_noise_temp0_not_deterministic.md). Qwen3.6-35B-A3B-6bit at temp=0 on oMLX/MLX is NOT byte-deterministic across runs — 4 identical-config runs of pylint-4604 produced {0, 16, 16, 19} patches. Working pattern: 3-rep n=14 baseline (characterize noise), single-arm n=75 (hypothesis), rep-2 n=75 (ship decision). Caught every false-positive in this cycle.

Cycle artifacts (acceptance/forge-hybrid/, all gitignored): baseline_n14_rep{1,2,3}/, baseline_n75/, tiered_compact/treatment_n{14,75}/, tiered_compact_stress_t040/, tiered_compact_n75_t050/, tiered_compact_stress_n75_t040/, tiered_compact_n75_p50_85_95/, tiered_compact_n75_p50_85_95_rep2/, respond_terminal_b{1,2}_smoke_n14/, trajectory_shape_d_smoke_n14/, protected.json (17 protected instances). Plan executed → see file footer for closeout.

Suggested cold-start sequence: read this entry + lessons.md 2026-05-28 (forge-hybrid closeout entry) + lessons.md 2026-05-26 (substrate noise) + the agents.sdd "forge-hybrid Phase 2 (A) compaction invariants" section. If a SWE-bench / maintain_suite / BFCL run behaves unexpectedly post-cycle, the first thing to try is LUXE_TIERED_COMPACT=0 to bisect whether default-ON compaction is the cause. No follow-up cycle is queued.


⇒ PREVIOUS SESSION HANDOFF (2026-05-28) — Extended-benchmark suite SHIPPED (5 new evals + scaffolding + tests) + 6-bit baseline established; commit local, push pending

TL;DR for a cold start. Added a broad-capability benchmark layer (MMLU / ARC-Challenge / GSM8K / CodeNeedle / Perplexity) alongside the agentic suite (BFCL / SWE-bench / maintain_suite). Captured a clean baseline of the existing agentic benchmarks on Qwen3.6-35B-A3B-6bit. All new code is in this commit; existing HTTP Backend is unchanged so the agentic-suite path carries zero risk. The mlx_lm in-process backend is a sibling path used only by logprob-based evals — necessary because oMLX silently drops logprobs / top_logprobs on both /v1/chat/completions and /v1/completions. Local main has this work staged for commit; push pending user approval. Working tree dirty as planned: new dirs + 3 pre-existing m5 edits to benchmarks/swebench/* + src/luxe/agents/loop.py. lessons.md has an unrelated entry written by the user.

Companion private repo: session artifacts (plan, summary, baseline numbers) live at michaeldtimpe/extended-bench-luxe-research (private). The luxe repo holds the code; the research repo holds the design + session log.


What landed in this commit

5 new benchmarks under benchmarks/ (each: run.py + adapter.py + grade.py):

  • gsm8k/ — 8-shot CoT (Wei et al. canonical exemplars), ####-marker extraction with <think>-block stripping
  • codeneedle/ — vendored upstream extract.py (esprima AST) + scorer.py (SequenceMatcher); manifest frozen at seed=42 with 11 needles in http_server.py + 16 in jquery.js
  • mmlu/ — 5-shot per-subject (Hendrycks protocol), first-token logprob over A/B/C/D via mlx_lm direct
  • arc_challenge/ — 0-shot, variable choice count (3/4/5), first-token logprob via mlx_lm direct
  • perplexity/ — sliding-window over WikiText-103 test, in-process mlx_lm (internal regression metric only, NOT leaderboard-comparable)

Shared scaffolding in benchmarks/_eval_common/:

  • extract.pyextract_gsm8k_answer, extract_choice_letter, strip_think_blocks (think-blocks stripped before all answer extraction)
  • choices.pyformat_mc_prompt handling 3/4/5 options
  • fewshot.pydeterministic_sample + GSM8K_8SHOT_EXEMPLARS (Wei et al. verbatim)
  • logprob.pyplan_sliding_windows (no double-counting; boundary tokens correctly skipped when stride=window) + aggregate
  • meta.pybuild_run_meta collecting eval_suite_version, protocol_version, dataset sha256, model_id, sampling, luxe_commit, timestamp_utc
  • dataset.pycache_dir, sha256_verify, jsonl_load
  • mlx_direct.pyMLXDirectBackend wrapping mlx_lm.load; exposes token_logprobs_from_ids (perplexity) and first_token_top_logprobs + score_choices (MMLU/ARC). Sequencing constraint: do not run while oMLX holds the same weights — ~25 GB doubled

Scripts:

  • fetch_{gsm8k,arc,mmlu,wikitext}_data.py — vendor data to ~/.luxe/<bench>-data/ with sha256 capture
  • build_codeneedle_manifest.py — one-shot manifest freezer
  • run_eval_suite.sh — sequences HTTP-Backend phase (gsm8k, codeneedle) before mlx_direct phase (mmlu, arc, perplexity), with an oMLX-stop prompt between
  • aggregate_eval_suite.py — reads all summaries → markdown

Tests: 102 offline unit tests (pure-function, ~0.1s); tests/test_mlx_direct_smoke.py gated by new live_model pytest marker

Dependencies: new optional-deps group extended-bench = ["esprima>=4.0", "mlx_lm>=0.31", "datasets>=2.0"] in pyproject.toml

Plan file: /Users/michaeltimpe/.claude/plans/dazzling-tickling-bengio.md (also copied into research repo as PLAN.md)

Baseline numbers captured this session (existing agentic suite)

Stored at acceptance/eval_suite_baseline/2026-05-27_6bit/. Wall: ~13h 8min total (BFCL → maintain_suite → swebench-smoke3 chain).

BFCL raw — 1150/1640 (70.12%) on Qwen3.6-35B-A3B-6bit, temp=0:

  • simple_python 400 → 84.25% | multiple 200 → 81.50% | parallel 200 → 65.50% | parallel_multiple 200 → 48.00%
  • irrelevance 240 → 92.08% | multi_turn_long_context 200 → 39.00%

maintain_suite — 10/10 pass, 40/50 score, v1-release-eligible. 2 fixtures lost a point on pr_opened (failing tests at PR-open → draft PR opened instead).

SWE-bench preds-only smoke — 3/3 patches produced. astropy 12907 / 13033 / 13236 (13/14/26-line patches). FAIL_TO_PASS correctness scoring deferred (needs Docker harness against predictions.json).

Dependencies + dataset state

  • ~/.luxe/{gsm8k,arc,mmlu,wikitext}-data/ all populated, row counts match expected, sha256 recorded in fetch logs:
    • gsm8k test 1319 / train 7473 (sha256 3730d312f6e34405, 17f347dc51477c50)
    • arc challenge_test 1172 (sha256 062fe98a0d64b0bb)
    • mmlu test 14042 / dev 285 / 57 subjects (sha256 30225733916644b7, 147bce5b06a81d81)
    • wikitext-103-raw test 1,289,979 chars (sha256 aca2f46735043bcf)
  • .venv updated: esprima 4.0.1, mlx_lm 0.31.3, mlx 0.31.2, sentencepiece 0.2.1, datasets already present

How to continue in a new session

  1. First-time setup (already done on this machine, included here for reproducibility):
    • pip install -e .[extended-bench]
    • python scripts/fetch_{gsm8k,arc,mmlu,wikitext}_data.py
  2. Run the new suite (oMLX must be free for the mlx_direct phase; the runner prompts you to stop it between phases):
    • bash scripts/run_eval_suite.sh --limit 100 for calibration; expect MMLU 65–80%, ARC 75–85%, GSM8K 70–85%
    • bash scripts/run_eval_suite.sh for the full run (slow — multiple hours)
  3. Establish baseline after a verified clean run: copy summary.md to acceptance/eval_suite/baselines/<model_id>_v0.1.0.md and commit
  4. Next iteration: SWE-bench Docker harness on the smoke-3 predictions.json to get real FAIL_TO_PASS pass rates; consider patching oMLX to surface logprobs so MMLU/ARC can share the HTTP path

Open items (none blocking)

  • mlx_lm-direct smoke test (tests/test_mlx_direct_smoke.py) deferred until oMLX freed
  • Hardcoded sha256 values in fetch scripts not yet pinned (captured in this session's logs; pin in a follow-up)
  • Baseline-comparison tooling: first-run baseline is manual; auto-diff/alert is a follow-up

(historical sessions below)

⇒ SESSION HANDOFF (2026-05-26) — Track 0 WASH + edit-quality investigation CLOSED (refined-port REFUTED); diagnostic flag + docs SHIPPED (122831d); NO task in flight

TL;DR for a cold start. Two investigations executed: (1) Track 0 forge-vs-luxe at n=75 → WASH (architecture line retired); (2) edit-quality follow-up that diagnosed luxe's early_bail family as the degrader, ablated it, and tested a refined port → all conclusions banked, no behavior ships. The investigation infrastructure (default-OFF LUXE_EARLY_BAIL_COMMIT_ONLY flag in loop.py + adapter/CLI plumbing + 2 unit tests) and both docs landed as one commit 122831d on 2026-05-26. Local main is ahead of origin/main by 1 (push pending user approval). Working tree clean.


Today's two investigations (chronological)

1. Track 0 forge-vs-luxe loop A/B → WASH at n=75 (the architecture line retires).

  • luxe 30/75 (40.0%) | forge 32/75 (42.67%) → Δ +2 (+2.67pp). Gate #2 ≥5pp FAIL.
  • Paired completion-tokens ratio 1.97× (n=25 paired). Gate #3 ≤1.5× FAIL.
  • Joint = WASH. 0 harness errors, 75/75 valid pairs.
  • The clean Pareto superset at n=14 (forge ⊇ luxe) did not hold at n=75 — at scale it's a +5/−3 trade with 3 luxe-exclusive resolves now existing (django-11333, xarray-3095, sympy-12096). Confirmed the "n=14 can't separate small real edge from favorable draw" caveat.
  • Cost-of-success surprise (median tokens/resolve): forge 4,344 vs luxe 8,574 → forge 0.51×. Aggregate 1.97× comes from forge running full budget on hopeless cases.
  • 5 new forge fragilities at scale: ToolCallError: Retries exhausted (heavy-reasoner malformed emissions) — hidden at n=14.

2. Edit-quality investigation (the durable observation from Track 0):

Phase 1 — Forensic diagnostic (read ~/.luxe/runs/<run_id>/events.jsonl): on the 4 edit-quality differential instances, 100% correlation between luxe-intervention firing and edit-quality degradation. The 3 forge-only wins (django-10880, requests-1724, sphinx-10673) each had 2-3 luxe early_bail family interventions fire (soft_anchor + breadth_probe — all "commit now / narrow / write now" pressure). The 1 forge loss (django-11333) had zero luxe interventions fire — clean luxe trajectory + correct patch.

Phase 2 — Ablation --no-early-bail:

  • n=14: +2 resolves clean, watchdog clean → proceeded to n=75.
  • n=75: +8 resolves (+10.67pp) but watchdog FAILED (4 wrong_target migrations: matplotlib-25775, pylint-6528, sympy-13091, sympy-17318). Per pre-registered band: STOP, non-Pareto repeat of v1.7→v1.11 trade.
  • Cost-of-success at n=75: +10.67pp resolves with only 3 genuine wrong_target damages (historical "10/18" warning did NOT reproduce). 2.2× faster wall (no intervention tokens → cleaner convergence).

Phase 3 — Refined port LUXE_EARLY_BAIL_COMMIT_ONLY=1 (hypothesis: keep commit_imperative at score ≥0.40, suppress soft_anchor + breadth_probe — the high-conv imperative is the protective variant):

  • n=14: +1 resolves AND 1 watchdog hit → STOP per pre-registered band, hypothesis REFUTED.
  • The pivotal instance is matplotlib-20826: baseline empty → --no-early-bail RESOLVED → refined commit_only wrong_target. commit_imperative fired (score climbed to ≥0.40), drove a premature commit to the wrong place. commit_imperative ALSO degrades edit quality — the whole early_bail family pressures premature commits; isolating commit_imperative doesn't help.

What stays vs ships

  • No source change ships. The trade-off documented across these two investigations matches luxe's v1.7→v1.11 tuning history and the 2026-05-24 reflect-cycle HOLD: relaxing premature-commitment pressure trades empty→wrong-action for some empty→resolve. The net is non-Pareto on the wrong-target axis.
  • Edit-quality is a real and now-mechanistically-characterized phenomenon but not portable as a luxe lever via existing interventions. The diagnostic + ablation infrastructure is the durable output.

Shipped state — commit 122831d (one commit, 6 files; behavior byte-identical with baseline)

Documentation (2 files):

  • RESUME.md (this entry).
  • lessons.md (2026-05-26 entries: Track 0 WASH + edit-quality investigation).

Luxe-source diagnostic infrastructure (4 files, default OFF + byte-identical):

  • src/luxe/agents/loop.py (+46/−11): the LUXE_EARLY_BAIL_COMMIT_ONLY env var + breadth_probe/soft_anchor suppression + commit_imperative preservation + a new early_bail_suppressed_commit_only observability event.
  • benchmarks/swebench/run.py (+8): new --early-bail-commit-only CLI flag.
  • benchmarks/swebench/adapter.py (+6): plumb the new parameter through run_instance.
  • tests/test_loop_adaptive_policy.py (+62): 2 new tests (low/mid-conv suppression, high-conv preservation). PASS.

Test suite: 910 tests pass (zero regression from default-OFF byte-identity).

Memory (outside repo, written):

  • project_track0_forge_n75_wash.md (Track 0 result).
  • The edit-quality investigation result is captured in lessons.md + this RESUME entry; a dedicated memory file is not required (no future-recall recommendation arises since the lever was refuted).

Scratch (retained for any future re-use, outside repo):

  • ~/Downloads/forge-luxe-research/ — forge venv, grading venv, all per-instance + grading dirs, comparator JSONs, harness scripts, full NOTES.md briefing.

Suggested cold-start sequence in the new session

  1. Read this RESUME entry + the 2026-05-26 lessons.md entry.
  2. Push 122831d to origin/main if not already pushed (auto-rebase hook will fast-forward; git status to check ahead/behind). The commit is intentionally low-risk (default-OFF flag + docs).
  3. Track 0 + edit-quality lines are now closed. No follow-up is precommitted. Options remain: Track 2 (tiered compaction) was already noted as likely-cut; pick a fresh value axis (BFCL ceiling, new benchmark, model-capability re-bench if a stronger MoE appears — see CLAUDE.md single-champion policy).
  4. Graceful context lifecycle (G1) is now scoped at docs/g1-context-lifecycle-design.md — empirical basis: ~25% of SWE-bench empty_patch failures are EMPTY_PATCH_CONTEXT_EXHAUSTED, stable across 11 versions (see docs/research/e1-context-cliff-report.md). Design only, no implementation cycle queued; the doc lists six candidate levers with tie-in points. Entry point for whichever cycle picks this up next.

⇒ PREVIOUS SESSION HANDOFF (Track 0 WASH only, written overnight before edit-quality investigation) — superseded by the entry above

TL;DR. Track 0 ran to a clean honest WASH at n=75 (the largest test the smoke-then-scale plan called for). The architecture line ("forge's loop beats luxe's") retires for this stack; mechanistic observations preserved. Working-tree changes are uncommitted (this RESUME entry + a lessons.md entry); review the diffs and commit if you want. Per-CLAUDE.md "ask first," nothing was pushed or committed overnight.

Track 0 result (Milestone 2, n=75, co-graded 2026-05-26):

  • luxe 30/75 (40.0%) | forge 32/75 (42.67%) → Δ +2 (+2.67pp). Gate #2 ≥5pp FAIL.
  • Paired completion-tokens ratio 1.97× (n=25 both-have-tokens). Gate #3 ≤1.5× FAIL.
  • Joint verdict: WASH (both gates FAIL). 0 harness errors, 75/75 valid pairs.
  • The clean Pareto superset at n=14 was a small-n favorable draw — at n=75 forge is a +5/−3 trade (3 luxe-exclusive resolves now exist: django-11333, xarray-3095, sympy-12096), not a superset. The language discipline ("n=14 can't separate a small real edge from a favorable draw") was right.

What survives the wash (durable, luxe-portable observations):

  • Give-up-avoidance is a real but two-sided mechanism. matplotlib-13989 reproduces at n=75 (forge converts a luxe give-up) — and 3 other forge wins follow the same shape (matplotlib-24870, psf-requests-1724, django-10880). BUT 2 luxe-exclusive forge losses (xarray-3095, sympy-12096) come from forge running to max_iter where luxe's earlier termination landed the fix → the same trade luxe's v1.7→v1.11 tuned aggressively toward bailing, and that the reflect-cycle closed as HOLD for. Predicted by prior history; confirmed at scale.
  • Edit-quality wins are real but rare: sphinx-10673 reproduces (forge's content correct on the same 2 files where luxe's was buggy); psf-requests-1724 similar.
  • Cost-of-success (median tokens-per-resolved): forge 4,344 vs luxe 8,574 → forge 0.51×. Forge is half the per-success cost when it succeeds; aggregate 1.97× comes from burning the full 30-step budget on unconvertible cases.
  • New scale-only forge fragility: 5 ToolCallError ("Retries exhausted after 3 consecutive failed attempts") against the champion's output shape (heavy-reasoner malformed tool emissions). Hidden at n=14, real at n=75.
  • Forge's respond-terminal discipline actually scaled UP: max_iterations 64% at n=14 → 45% at n=75 (terminal_respond 36/75). Less wall-heavy than the smoke suggested.

Why no port-the-mechanism follow-up was queued: The smoke's clean superset suggested "relax luxe's early-bail." n=75 refutes the clean part — at scale, relaxing early-bail would likely trade give-up→resolve for resolve→empty (the v1.7→v1.11 / reflect-HOLD non-Pareto pattern, now reproduced under forge's loop). The wash is the honest outcome; no luxe lever change is queued.

Working tree (drafts, uncommitted): This RESUME entry + a 2026-05-26 entry in lessons.md. Memory: project_track0_forge_n75_wash.md (written to memory dir, outside repo). Scratch artifacts retained under ~/Downloads/forge-luxe-research/ (full comparator JSON at results/phase2_comparison.json, briefing in NOTES.md, both arms' per-instance + grading dirs).

Next (nothing precommitted): Track 0 architecture line is closed; Track 2 (tiered compaction) was already noted as likely-cut (0 overflows at 131072); pick a fresh value axis. This warrants a new conversation. Plan: ~/.claude/plans/binary-gathering-panda.md (executed). Executed plans (now DONE): ~/.claude/plans/noble-squishing-kahn.md, ~/.claude/plans/velvety-purring-forest.md, ~/.claude/plans/binary-gathering-panda.md.


⇒ PREVIOUS SESSION HANDOFF (2026-05-25) — all work landed; tree clean, in sync; NO task in flight

Repo state: main is linear + in sync with origin/main (HEAD 4327593); working tree clean; full suite 978 pass. Nothing is in progress — this is a clean cold start.

What closed (all committed + pushed):

  • Reflect/verify cycle: CLOSED. Phase 2 repair = HOLD (LUXE_REFLECT stays opt-in). The borderline give-up label spot-check is DONE — 13/14 upheld, 1 non-gate-moving flip (miss_param_159); gate UNCHANGED (miss_func detection 81.8%, false_gap 16.7%, PASS). Detail in the ACTIVE section below + [[project_reflect_cycle_phase1]].
  • WS2 "acted-but-wrong-binding" sizing → BANK (no lever). State-checker-decisive wrong-bindings = only 21/151 (13.9%), below the escalation bar; dominated by exact-free-text-content (the content ceiling), recipient_id 0-decisive. Tools: scripts/analyze_acted_but_wrong.py + scripts/verify_wrong_binding_attribution.py (+ tests/test_wrong_binding_sizing.py). [[project_acted_but_wrong_sizing]].
  • Git workflow hardened. origin/main enforces linear history (no merge commits / no force-push; admin-bypass only). A committed PreToolUse hook (.claude/hooks/precommit-pull.sh, wired in .claude/settings.json) + repo-local pull.rebase/rebase.autoStash auto-rebase before every commit. Rebase, never merge. [[feedback_git_linear_history]].

Next (nothing precommitted): the cycle is closed; the residual multi_turn failure mass is a reasoning/obligation + benchmark-rigidity ceiling, not a new addressable axis. Untouched options remain (Track 0 forge-vs-luxe loop A/B; Track 2 tiered compaction) or pick a fresh value axis — this warrants a new conversation, not a continuation. Read CLAUDE.md + this file + lessons.md (2026-05-25 entries) first. Executed plans (now DONE): ~/.claude/plans/noble-squishing-kahn.md, ~/.claude/plans/velvety-purring-forest.md.


⇒ ACTIVE (2026-05-24): reflection/verify cycle — Phase 2 repair = HOLD (miss_func +6 net, but non-Pareto + kill-warning); CYCLE CLOSED

First feature-adding cycle since the multi_turn sweep closed. Goal = move benchmark scores; all invariants firm (single-champion, mono-only, temp=0/reproducibility, vertical+oMLX-only). Two external research reports (forge, Hermes) were mapped against the code + .sdd and mostly cut (invariant-conflicts / out-of-scope / already-done). The one novel+compatible idea: a same-model verify/reflect pass targeting the residual "right file, wrong/no change" + premature give-up mass. Plan (approved): ~/.claude/plans/glistening-squishing-alpaca.md.

Cycle shape (locked pre-registrations):

  • Track 1 (main): reflect pass — verify-only first, then repair. Gate: per-axis detection (miss_func ≥40%, miss_param moot) AND false-gap ≤20%; fire policy ≤5%→always-on / 5–20%→gated /

    20%→kill. Repair budget: 1 re-prompt/turn (mt), 1 loop re-open ≤3 steps (swe), no new tools, hard stop. Verifier is critique-only / functional-blocker-only / benchmark-generic (anti-overfit).

  • Track 0 (parallel, NOT STARTED): forge-vs-luxe loop A/B — scratch ~/Downloads/forge-luxe-research/, 48h timebox, decisive-win = ≥3 more resolves n=14 AND ≥5pp n=75 AND portability (≤1.5× inflation). Gates the SWE-bench half only (multi_turn doesn't use run_agent's loop).
  • Track 2 (conditional, NOT STARTED): tiered compaction — auto-cut unless long_context elision fires + drops needed context / attention-dilution shown. (Evidence so far: 0 overflows at 131072 → likely cut.)
  • CUT: Anthropic eval-judge.

WHAT'S DONE (this session, all on main; tree committed):

  • src/luxe/agents/reflect.py — the verify primitive: whole-conversation multi_turn framing (robust to message-less reveal turns; abstains on alt-completions), critique-only prompts, gap derived from ≥1 substantiated functional-blocker deficiency, last-JSON parser + high token budget (the champion is a heavy reasoner — see lessons), response_format json nudge.
  • src/luxe/backend.py — minimal response_format param (disable-equivalent).
  • src/luxe/agents/agents.sdd — reflect/verify surface contract (opt-in LUXE_REFLECT, disable-equiv, verify-only non-perturbing, functional-blocker gate, no benchmark-semantic prompts).
  • tests/test_reflect.py + tests/test_prompts.py reflect tests; full suite 955 pass.
  • scripts/analyze_empty_turn_convertible.py (Phase 0), scripts/dump_empty_turn_for_labeling.py (labeling dump), scripts/measure_reflect_phase1.py (label-grounded gate, per-pid verdicts saved).

PHASE 0 grounding (honest, supersedes the Plan-agent's "41 convertible"): hand-labeled all 58 empty_turn failures (the over_acted structural heuristic is unreliable). miss_func: ~22 unmet (repairable give-ups) / 7 met; miss_param: ~4 unmet / ~25 met. miss_param empty_turns are MOSTLY the model competently resolving the ambiguous param and completing the task — the state-checker fails it on a turn-path technicality (NOT give-ups, NOT repairable). Dominant miss_func mode: model claims a withheld-then-revealed tool "isn't available" and gives up ("tool-unavailable anchoring"). Labels: acceptance/bfcl/reflect_phase0/giveup_labels.json (gitignored — on-disk only, same-machine; ~9 borderline flags PENDING USER SPOT-CHECK; recreate via the dump script + re-label if lost).

PHASE 1 verify-only gate = PASS (acceptance/bfcl/reflect_phase1/verify_only_result.json, 118 calls, 0 errors): miss_func detection 81.8% (18/22), false-gap 16.7% (10/60)fire policy = gated-only. Headline: same-model temp=0 self-verification CAN separate give-ups from correct work here (the catch-22 didn't bite) — a real positive result. Nuance: the 10 false-gaps are mostly verifier-vs-state-checker divergence (the verifier flags "confirm/convey/report" sub-asks the state-checker ignores), not pedantry — so the pass sample isn't fully flawless w.r.t. the user's ask. Detection misses (4): miss_func_33 (wrong recipient), _142 (partial), _122 (malformed→fooled), _93.

PHASE 2 repair — BUILT + COMMITTED (7c621c8), default-off byte-identical; full A/B DONE → HOLD. The gated verify→repair stage is wired into run_problem_multi_turn (opt-in LUXE_REFLECT):

  • Two-gate fireadapter._is_giveup_turn (a ZERO-call turn = the empty_turn give-up signature) gates the expensive verify call AND, by construction, skips the verifier's reporting-gap false-gaps (they have non-empty action sets); verify must THEN return gap=true.
  • Budget (locked): ONE _luxe_repair-marked corrective nudge (reflect.repair_nudge, generic, consumes the verdict's deficiencies, no benchmark semantics) + one bounded re-prompt over the SAME exposed tool surface, appended to the same turn (decoded_turns[-1]), hard stop (no re-verify, no loop). repair_turns records where it fired. agents.sdd Phase 2 invariants added.
  • scripts/ab_multi_turn_miss.py is the ship-gate instrument; +10 tests; full suite 965 pass.
  • Byte-identity validated on REAL problems: refactored off-path reproduces m5_rep_1 exactly for miss_func_7/9/15 (empty_turn, 0 fire). Host = M5 Max (Mac17,6, 128 GB) = the m5_rep_1 baseline host → temp=0 determinism means m5_rep_1 is a valid clean arm (only the reflect arm is being run).

⇒ PHASE 2 A/B VERDICT = HOLD (keep LUXE_REFLECT opt-in; ship gate FAILED on 2 of 3 criteria). Reflect arm acceptance/bfcl/multi_turn_miss_func/reflect_arm/ (200/200, 0 errors, 9623s); clean = m5_rep_1 (reused, M5/temp-0 valid). scripts/ab_multi_turn_miss.py:

metric result
overall pass clean 50.0% (100) → reflect 53.0% (106), Δ +6
flips 8 fail→pass / 2 pass→fail (net +6)
repair fired 66/200
no-op leaks (non-fired ≠ clean) 0 ✓ (repair is a clean no-op off-target)
empty_turn→mismatch migrations 16 ✗ (the HARD kill-warning)

The +6 score is real but it FAILS the pre-registered Pareto+safety gate: 2 pass→fail regressions AND 16 give-ups turned into WRONG actions. Repair makes the model act-when-uncertain, and ungrounded it is wrong far more than right: 8 fixes vs 18 made-worse (16 migrations + 2 regressions). Both score regressions are Phase-1 false-gaps (16.7%) materializing as damage — verify false-flagged a deliberately-empty turn in a passing problem, and the "don't stop until it's done" nudge induced over-action/runaway: _112 spiraled into 40+ get_symbol_by_name calls (hit the 50-call cap, never advanced → empty_turn); _184 over-booked on turn 0 → instance_state_mismatch. The 8 genuine fixes are real give-up→complete conversions (_7/_39/_94/_100/_146/_154/_164/_194). Smoke→full consistency was exact (_7 fix, _9/_15 migrate). Negative datapoint BANKED (the plan treats it as first-class).

Why HOLD despite +3pp: the gain is bought by encouraging a behavior (ungrounded action) that won't generalize and is less safe than a give-up (an empty turn is a safe failure; a wrong action is not). Same shape as the Part-A GFS-guidance non-Pareto wash — a score nudge with deterministic regressions stays opt-in. LUXE_REFLECT remains off by default; the stage + A/B harness stay in-tree, documented.

⇒ NEXT (cycle effectively closed; options, none precommitted):

  1. Optional refinement (would need a re-bench): the 2 regressions are budget/false-gap-driven — a much tighter repair budget (≤2–3 steps, not the full 15) would likely kill the _112 runaway and may recover both regressions; tightening the give-up gate to skip turns the clean model deliberately left empty would cut false-fires. Neither touches the 16 migrations — those are the core limit (self-repair without fresh grounding acts wrong), so even a Pareto-clean refinement is a small win.
  2. Borderline label spot-check (14 non-clear labels) — low-value now (gate cleared comfortably; the action moved to repair quality). On-disk only.
  3. Untouched cycle tracks: Track 0 (forge-vs-luxe loop A/B), Track 2 (tiered compaction). Or pick a new value axis (multi_turn sweep is closed; SWE-bench loop near ceiling per prior grounding).

⇒ Phase 2 follow-ups SHIPPED (8fadd92; plan ~/.claude/plans/noble-squishing-kahn.md) — hygiene/closure, NO re-bench, HOLD stands, no ship-status change. (1) tight repair-budget cap _REPAIR_MAX_STEPS=4 in benchmarks/bfcl/adapter.py (artifact-scoped: covers every observed Phase-2 repair, bounds the _112 runaway) + agents.sdd + cap test (test_repair_respects_tight_step_cap; full suite 966). (2) borderline-label tooling: dump_empty_turn_for_labeling --only-borderline (prints the 14 pending labels + saved verdicts side-by-side) and measure_reflect_phase1 --from-verdicts (offline gate recompute from the frozen per-pid verdicts — no oMLX; reproduces 0.818/0.167/true bit-exactly). (NB: Phase 1 saved gap/ok/specificity, not the deficiency free-text, so the dump shows specificity tags.)

⇒ Borderline spot-check DONE (2026-05-25; plan ~/.claude/plans/velvety-purring-forest.md). User reviewed all 14 borderline give-up labels; outcome encoded as reviewed_label/review_note in giveup_labels.json (originals preserved; rationale archived at acceptance/bfcl/reflect_phase0/borderline_review.md). 13/14 upheld, 1 flip (miss_param_159 met→unmet — wrong insurance cost 50 vs 500; verify had correctly flagged it). Recompute (measure_reflect_phase1 --from-verdicts): miss_func detection 81.8% (18/22) and false_gap 16.7% UNCHANGED, GATE PASS; only the (un-gated) miss_param detection moved 3/4→4/5. The human review validates the detection figure rather than changing it — Item 2 fully closed.

⇒ WS2 DONE = BANK (2026-05-25): "acted-but-wrong-binding" axis sized, NO lever. Read-only scripts/analyze_acted_but_wrong.py (+ tests/test_wrong_binding_sizing.py, 11 tests; full suite 977) diffed model vs GT calls over the never-examined acted-but-wrong mass (instance_state/execution_response failures, A=151 = 71 miss_func + 80 miss_param; disjoint from the 58 give-ups). Buckets: gt_value_mismatch 58 (38.4%), omission 60, extra_action 33, path_divergence 0. A counterfactual deep-dive replaced the eyeball skim (scripts/verify_wrong_binding_attribution.py): substitute GT value(s) back + re-run the vendored state checker (sanity-gated: reproduces 58/58 stored verdicts) → a fail→PASS flip = DECISIVE binding. DECISIVE wrong-binding = 21/151 = 13.9% (by subtype: string_format 17, numeric 7, recipient_id 0). Two corrections to the first writeup: string_format is NOT mostly benign (17 decisive) but those are almost all exact-free-text-content matches (reproduce the author's exact tweet/message/ticket wording — the content ceiling, not a binding); and recipient_id is 0-decisive (the human review's "wrong recipient" headline is never the sole cause in the acted set). Pre-registered gate → BANK: 21 < 30 and 13.9% < 20% (below the size bar) + no dominant separable addressable cluster + the only fix is washed-out exact-content enforcement. Taxonomy is the deliverable: the mass is mostly omission (the obligation/final-step ceiling, same family as the give-up HOLD) + GT/content rigidity — not a new addressable axis; the 50.0%/45.5% baselines are partly depressed by benchmark exactness. Manifest acceptance/bfcl/wrong_binding/sizing_manifest.json (gitignored); lessons.md + memory project_acted_but_wrong_sizing.md. **Borderline-review plan fully executed

  • deep-dive-confirmed.**

Reproduce: Phase 0 .venv/bin/python -m scripts.analyze_empty_turn_convertible; relabel dump .venv/bin/python -m scripts.dump_empty_turn_for_labeling; Phase 1 .venv/bin/python -m scripts.measure_reflect_phase1 (needs oMLX up; ~1hr; --smoke N for a quick check). oMLX on localhost:8000, champion loaded as Qwen3.6-35B-A3B-6bit (lowercase alias resolves). .venv/bin/python on this host. CAVEAT: the verifier needs the high token budget + last-JSON parser — json-mode / /no_think / prefill do NOT suppress this champion's reasoning (lessons.md 2026-05-24).

✅ DONE (2026-05-23, M5 Max 128 GB): BFCL multi_turn SWEEP COMPLETE — all 4 categories baselined

The multi_turn category sweep is closed. All four categories now have a faithful champion (Qwen3.6-35B-A3B-6bit, temp=0) baseline. Difficulty order is as expected (adversarial categories hardest):

category baseline host num_ctx artifacts
multi_turn_base 63.5% (127/200) M5 32768 acceptance/bfcl/multi_turn_base/m5_faithful_rep_1/ (faithful; supersedes M1 63.0%/126)
multi_turn_long_context 57.5% (115/200) M5 131072 acceptance/bfcl/multi_turn_long_context/m5_faithful_rep_1/
multi_turn_miss_func 50.0% (100/200) M5 32768 acceptance/bfcl/multi_turn_miss_func/m5_rep_1/
multi_turn_miss_param 45.5% (91/200) M5 32768 acceptance/bfcl/multi_turn_miss_param/m5_rep_1/

All M5 runs: 0 overflows, 0 errors, 200/200 graded. base (M5 faithful): 127/200, wall 4056s, avg 20.3s, 47 instance_state_mismatch / 17 execution_response_mismatch / 9 empty_turn. miss_func: wall 6650s, avg 33.2s, failure modes 55 instance_state_mismatch / 29 empty_turn / 16 execution_response_mismatch. miss_param: wall 5287s, avg 26.4s, 65 instance_state_mismatch / 29 empty_turn / 15 execution_response_mismatch. The miss_ categories show ~4× the empty_turn rate of base/long_context (29 vs ~8)* — the model gives up more often when a needed tool/param is withheld (the intended challenge). See [[project_bfcl_multi_turn_miss_baselines]] and [[project_bfcl_multi_turn_long_context_baseline]].

Mechanic (shipped 4b5d462): run_problem_multi_turn now derives the exposed tool surface PER TURN from two problem fields — missed_function {turn:[names]} (held out until that turn, exposed from it onward; strict > so a fn keyed k is available AT turn k, matching upstream base_handler) and excluded_function (hidden the whole conversation). tool_fns stays complete (only exposed DOCS are filtered — faithful to BFCL decode-and-execute); the vendored state-based checker is unchanged. Routing, grading, and data-loading were already category-agnostic (the checker self-derives test_category from the id). Per-turn exposed_tool_names is recorded (the only record of the withholding schedule, since the grader is exposure-agnostic). Validation: audit 200/200 in-scope each category; reveal semantics proven (off-by-one >= would spike GT-unreachable from 1→206); GT-as-pred 200/200 both; +12 tests; full suite 940 pass. No parity oracle on M5 (bfcl_eval absent) — relied on the M1-parity-verified category-agnostic grader + generation-side validation (the parity blind spot the plan flagged).

excluded_function faithfulness fix (CAVEAT for base + old long_context). The pre-4b5d462 driver IGNORED excluded_function, so 18/200 problems in EVERY category had cp/mv/rm wrongly exposed (upstream — and the GT — exclude them; base GT never calls them). The fix applies it uniformly. Impact, fully characterized: long_context dropped 58.5%→57.5% — exactly 2 deterministic flips (_1, _40, both True→False), both within the 18, 0 flips outside (determinism confirmed). So the old M5 long_context 58.5% (m5_rep_1/) was inflated by 2; the faithful number is 57.5% (m5_faithful_rep_1/). multi_turn_base was then re-measured faithfully on M5 = 63.5% (127/200) (m5_faithful_rep_1/), superseding the M1 63.0% (126/200). The +1 cannot be cleanly attributed to the fix alone — no M5-unfaithful base run exists to diff against, so it conflates the excluded_function fix with any M1→M5 substrate difference — but it confirms the impact is tiny (consistent with long_context's 2-problem move). Lesson recorded in lessons.md 2026-05-23.

Reproduce (resume-safe; .venv/bin/python on this host — bare python3 is Homebrew 3.14 w/o luxe):

# miss_func / miss_param (NOT long_context → 32768 fine):
.venv/bin/python -m benchmarks.bfcl.run --categories multi_turn_miss_func multi_turn_miss_param \
  --num-ctx 32768 --temperature 0 --model qwen3.6-35b-a3b-6bit --base-url http://127.0.0.1:8000 \
  --output <dir>/   # NB: one --output → <dir>/<category>/ subdirs; use per-category dirs for convention
# long_context faithful (needs the big window):
.venv/bin/python -m benchmarks.bfcl.run --categories multi_turn_long_context --num-ctx 131072 ...

Setup on a fresh M5 clone: pip install -e ".[dev,bfcl]" (incl. mpmath; do NOT install bfcl_eval — breaks src/luxe/symbols.py). bash scripts/fetch_bfcl_data.sh (now fetches all 4 multi_turn categories + a blocking GT pre-flight).

⇒ NEXT SESSION: multi_turn sweep is CLOSED (all of BFCL v4) — pick the next axis

BFCL v4 has exactly 4 multi_turn categories (verified 2026-05-23 against the upstream data dir: base, long_context, miss_func, miss_param) — all now baselined. There is no composite data file in v4 (it was a v3 category that v4 dropped; the vendored checker still references "composite" at multi_turn_checker.py:49/66 because it was copied from an older bfcl_eval, but no v4 dataset exists to run it). So the multi_turn capability axis is genuinely 100% complete — there is no remaining standalone multi_turn category to baseline. This is a clean point for a fresh user conversation on the next value axis (the post-v1.11 grounding concluded the SWE-bench loop/prompt levers are near their ceiling — see "What to do next session" further below). Part A (scoped GorillaFileSystem guidance) remains a non-Pareto wash, kept clean (opt-in LUXE_MT_CLASS_GUIDANCE, default off). Full multi_turn detail below.

Champion: Qwen3.6-35B-A3B-6bit (single, platform-stable, daily driver on M1 + M5)

luxe pins one MoE model in configs/single_64gb.yaml, and all ongoing development is centered on making that model better. The M5 Max m5max_moe bake-off (2026-05-10) confirmed it: 10/10 perfect, fastest wall, highest TPS, no bailouts — beat the two larger MoE candidates (Qwen3-Coder-Next-80B, GLM-4.5-Air-106B) on the same gate. The champion is the same on M1 Max (64 GB) and M5 Max (128 GB); no platform split. The bake-off is closed. If a re-bench is ever needed, see ~/Downloads/luxe/CLAUDE.md §"Single-champion policy" for the structure to follow.

Closed 2026-05-12: the M5 daily-driver shootout vs deluxe. luxe ran the same 10 maintain_suite fixtures on the M5 host against deluxe's strongest dense candidate (Qwen2.5-72B-Instruct-4bit-AWQ). Result: luxe 10/10 verified vs deluxe 4/10, 6.4× faster wall (41s vs 263s per fixture), 7.3× faster TPS (71.4 vs 9.8), ~11 GB less RAM. luxe is now the daily driver on both platforms. The shootout reference run is at acceptance/m5_shootout/ for future archaeology. The deluxe dense candidate set is exhausted; no further shootouts are queued.

Host lane assignment (closed 2026-05-12)

luxe is the daily driver on both M1 Max and M5 Max (Apple Silicon, 64 GB / 128 GB respectively) for maintain_suite, SWE-bench, and day-to-day agentic work. The deluxe dense-fork's M1 lane was paused 2026-05-11 (R1 BFCL champion Qwen2.5-32B-4bit and coder-tuned retry both rejected; dense 32B-class structurally exceeds M1 Max effective hardware capacity for maintain_suite gates) and the deluxe M5 lane was closed 2026-05-12 after the shootout. See ~/Downloads/deluxe/RESUME.md for the full closure record + Tier 1/2/3 open paths; lessons.md 2026-05-11 dense.M1 entry for the M1 cross-repo postmortem; ~/Downloads/deluxe/lessons.md 2026-05-12 entry for the M5 behavioral-ceiling diagnosis.

M5 (Apple M5 Max) was the MoE bake-off / substrate-validation lane in May (last closed: m5max_moe 2026-05-10, 30/30 across three MoE candidates) and is now the production lane alongside M1. This document tracks the luxe production state across both hosts.

Current state — 2026-05-22 (multi_turn BFCL — CYCLE COMPLETE; champion baseline 63.0%)

Shipped the deferred BFCL multi_turn category (stateful tool orchestration — a new capability axis, the chosen "next thing" after the post-v1.11 tracks closed). All 4 phases done + pushed to origin/main (433e8ac). Plan/phase detail: ~/.claude/plans/serialized-noodling-reef.md.

HEADLINE — champion (Qwen3.6-35B-A3B-6bit) multi_turn_base baseline: 126/200 = 63.0%, SHIP-GRADE (exactly deterministic over 3 reps). 3-rep variance (scripts/variance_multi_turn.py): rep_1=rep_2=rep_3=63.0%, spread 0.0%, 0 flips across 200×3 (126 stable-pass + 74 stable-fail) — multi_turn at temp=0 is fully deterministic on this substrate (stronger than SWE-bench's ±2 noise). Clean backend.chat generation, faithful vendored state-based grading; 0 errors, ~3.5h/rep, acceptance/bfcl/multi_turn_base/rep_{1,2,3}/. Failure modes: 49 instance_state_mismatch, 18 execution_response_mismatch, 7 empty_turn. Deep-dive (per-class): GorillaFileSystem 42% (hardest) → TradingBot 88% (easiest); turn-depth-independent; over-calling correlates with failure; no guardrail distortion. This is "luxe clean multi_turn" (no interventions; grader leaderboard-faithful, parity- verified). Reproduce: python -m benchmarks.bfcl.run --categories multi_turn_base --output <dir>. No ~/.luxe/runs manifest (backend.chat driver, no run dirs); per-problem JSONs + summary.json are the record.

  • Phase 0 (audit gate) — PASS: clean-subset (8 deterministic stdlib/numpy involved classes) covers 200/200 base problems; grading is plain == on stdlib attrs (no normalization) → faithful by verbatim vendoring; official checker runs locally as a parity oracle.
  • Phase 1 (vendor) — DONE (be45868): official tree_sitter-free state-based eval vendored to benchmarks/bfcl/multi_turn/ (8 classes + checker + utils + config + func-docs); 200/200 GT-as-pred PASS in MyEnv; corrupt→FAIL; oracle matches. bfcl_eval NOT in runtime (MyEnv); stale repo .venv is the read-only vendor source / parity oracle only.
  • Phase 2 (driver + grader) — DONE (2ee167e): clean backend.chat loop (run_problem_multi_turn) — NOT run_agent (no per-turn history seeding + interventions would contaminate); live persistent instances during generation, vendored checker re-executes on fresh instances during grading (faithful by construction). executor.py (serializer + fail-soft executor
    • tool surface), grade_multi_turn, run.py routing + transcript retention. 7 tests + full suite 928 pass; real-model n=2 smoke ran end-to-end (model used pwd/ls to navigate, state-graded, 0 errors). The replay-idempotence test caught + fixed a globals() instance-cache leak.
  • Phase 3 (parity) — DONE (433e8ac): scripts/parity_multi_turn.py re-grades the same decoded_turns with the official bfcl_eval scorer (stale .venv); n=25 parity = 25/25 match, 0 mismatch (vendored == official, env-equivalent). Surfaced + fixed a JSON-serialize crash (raw GorillaFileSystem Directory objects in the checker state-diff → default=str + write-fallback; n=2 smoke missed it, n=25 caught it). Prompt gate PASS: 25/25 emit tool calls (mean 8.8/problem), no prose-collapse.
  • Phase 4 (baseline) — DONE: n=200 → 63.0% (above; the n=25 sample's 40% was harder-than-average).

Multi_turn cycle CLOSED (base 63.0% ship-grade). Follow-on work (2026-05-22/23):

  • Part A — improve GorillaFileSystem (scoped guidance): NON-PARETO WASH, kept clean. Opt-in LUXE_MT_CLASS_GUIDANCE=1 (default off, byte-identical, scoped to GFS-involved problems) appends file-system precision guidance. Exact 0-variance A/B (clean rep_1 vs enhanced_rep_1): overall 63.0%→63.5%, GFS 42%→44%, net +1 (4 fixed: base_11/13/15/38, 3 broke: base_6/33/35), non-GFS 150/150 byte-identical. The 3 regressions are under-action (base_6 writes 5→2; base_33 calls 6→5) — the precision guidance trades over-action failures for under-action failures (classic non-Pareto; it DID cut GFS over-calling 8.3→7.8). Marginal/wash → clean stays the default; the guidance mechanism stays opt-in + documented (not a win). Another prompt-lever-washout datapoint, but with EXACT attribution (the 0-variance gift). scripts/ab_multi_turn.py.
  • Part B — long_context: baseline = 39.0% (78/200) at num_ctx=32768, but CONTEXT-LIMITED — not a clean capability number. Generation fix shipped (build_tool_surface forwards long_context= to _load_scenario; extension fires, GFS tree 466→12054). 43/200 (21.5%) FAIL on context-overflow — oMLX 400 "Prompt too long: ~35K exceeds 32768" as the big extension tool-results accumulate (the category is DESIGNED to exceed 32K). So 39% UNDERSTATES long_context capability. Grader robustness fix shipped (grade_multi_turn pads truncated trajectories to GT length → graded as fail, not a checker IndexError; +test). A proper long_context baseline needs num_ctx > 32768 — but that approaches the 36GB GPU-wired cap ([[project_omlx_host_capacity]]); raising it risks an MLX OOM/crash on the shared oMLX. NOT auto-launched overnight — IN-THE-LOOP CALL (suggest trying num_ctx≈49152, watch memory). Grader self-derives test_category from the problem id, so generation+grading are consistent (verified).
  • miss_func/miss_param DEFERRED — need dynamic per-turn tool-withholding (missed_function held out then re-added at its turn; excluded_function removed) whose generation-side correctness parity can't validate; mechanics documented (base_handler.py:108/176, utils.py:788). Implement carefully (not rushed).

Earlier state — 2026-05-22 (Track C grounding REFUTED its premise; Track D CLOSED — BFCL "irrelevance-only" was stale; full suite runs on current substrate)

Two roadmap tracks resolved by cheap grounding this session — neither needed a build.

Track C (above-loop signaling) — premise REFUTED before any code. The thesis was that task-semantics / traceback-locality signals (knowable above the loop) could fix what the loop layer can't. Grounding against the n=75 baseline taxonomy + verified.jsonl killed it: locus discovery is already solved — the model touches a gold target file early (≤4 steps) in 73/75 runs across every tier (wrong_target 14/17 early, empty 9/13 early; only 2/75 never touch a gold file). And tracebacks are rare and anti-correlated with success (9/75 issues; 7 of those 9 are wrong_target; gold file named in 25/75, more common in wrong_target 9/17 than strong 7/20). So the failures are not "couldn't find the file" — they're "found the file, produced the wrong/no change": a reasoning/content ceiling, not a locality one. Surfacing file locations can't help discovery that already works ~96% of the time. See lessons.md 2026-05-22; [[project_trackc_locus_grounding]].

Track D (BFCL substrate hygiene) — CLOSED as record-correction, not an unblock. RESUME framed it as "revert the bfcl_eval substrate so the full suite runs (irrelevance-only)." Both halves were stale. luxe's BFCL grader (benchmarks/bfcl/grade.py) is pure-Python (function-name + arg-allowed-set; 5 categories) and never imports bfcl_eval; the data is vendored (~/.luxe/bfcl-data/, commit dfdb0c8). The tree_sitter==0.21.3 conflict only ever affected data access via an old import bfcl_eval fallback, eliminated by vendoring. Smoke (2026-05-22, raw, 5/category) confirms the current substrate supports end-to-end execution + grading across all 5 categories: 20/25, nonzero passes in every non-irrelevance category (simple 4/5, multiple 5/5, parallel 4/5, parallel_multiple 2/5, irrelevance 5/5), no tree_sitter/bfcl_eval traceback. Fix shipped: removed the dead import bfcl_eval fallback in adapter.py + corrected docstrings/error to warn against installing bfcl_eval. Measurement debt now CLOSED: full-suite re-baseline ran 2026-05-22 on the current substrate (agent, n=1240, ~9h) = 90.24% total, byte-identical to v1.8 every category (simple 90.25 / multiple 88.50 / parallel 87.50 / parallel_multiple 83.00 / irrelevance 100), 0 errors — confirms zero regression across the swap + 5 releases + v1.11. Current-substrate reference = acceptance/bfcl/post_v1105_full_n1235_agent/rep_1/. See [[project_bfcl_full_suite_unblocked]].

Working tree: clean. 921 tests pass + 19/19 bfcl adapter. Commits this arc: 7991293 (v1.11.1 analyzer), b25e0d0 (Track D), + this handoff — all pushed to origin/main. No tag (v1.11.1 was a STOP; Track D was hygiene/record-correction — neither is a behavior ship; main runtime ≈ v1.10.5).

Earlier state — 2026-05-21 (v1.11.1 offline gate-design CLOSED — STOP at Gate A′; loop-layer-predicate line EXHAUSTED; main unchanged)

v1.11.1 = candidate B′ (predicate redesign of the v1.11 lever), run OFFLINE-ONLY. Outcome: Phase A′ decision gate returned STOP — no loop-layer predicate separates recovery from stall — so no code was wired and no bench was spent. main is unchanged from the v1.11 close (≈ v1.10.5 + calibrated observability). The v1.11.x adaptive loop-layer-predicate line is exhausted; next work should pivot to a different signal space (Track C) or housekeeping (Track D). See [[project_v1111_gate_design_stop]] and lessons.md 2026-05-21 v1.11.1 entry.

Working tree: analyzer + docs uncommitted (scripts/analyze_v1111_gate_design.py, lessons.md, RESUME.md, memory). 921 tests pass (no src/ change). NOT pushed, NOT tagged.

What v1.11.1 did: forked analyze_v111_calibration.pyscripts/analyze_v1111_gate_design.py. Mined the v1.10.5 BASELINE arm (post_specdd_v1105_n75, 225 retained event streams — uncontaminated, NOT lever-ON). Reconstructed two candidate gates per wall step: C1 temporal-persistence (consecutive trend≤0 via the production score_trajectory_trend over the convergence_score series; resets on positive trend) and C2 breadth-saturation (steps since a new successfully-touched distinct file path). Joined to baseline tiers (v1105_taxonomy); classes re-derived from n=75. Cross-validation: offline single-step reconstruction reproduced 45/45 actual lever-ON soft_anchor_collapse_promote_fired events.

The STOP result: band universe 92/225 (recovery 33, stall 59). The v1.11 single-step gate fired on 30/33 recovery (quantifies the v1.11 failure). No predicate clears "0 recovery false-positives" with useful stall coverage: strict-C1 K=5 → 0 recovery but only 1 stall (useless); C2 J=4 sheds xarray-3305 but still fires on pylint-4661 + 5 recovery; min_step sweep to 12 never clears; C1∧C2 conjunction at min_step=8 still fires on 6 recovery. Root cause: recovery and stall are structurally entangled in the score<LOW band — pylint-4661 sits at conv=0.0 with saturated breadth for steps 6–9 (indistinguishable from a stall) and commits a plausible patch only at step 13. "Late successful committer" vs "stall" is a reasoning/commit-timing property, not loop-observable. A predicate-only redesign cannot rescue a non-Pareto lever when target and protected classes are entangled in the signal it reads.

Reproduce: python -m scripts.analyze_v1111_gate_design (read-only; manifest → acceptance/v1111_gate_design/run_id_manifest.json).

Earlier state — 2026-05-21 (v1.11 cycle CLOSED — lever tried + reverted; main ≈ v1.10.5 + calibrated observability; NOT tagged)

v1.11 = candidate B (per-instance adaptive policy). Outcome: the activation lever was net-negative at n=75 and was reverted. No v1.11 tag. main sits at v1.10.5 behavior plus the Phase A calibration finding (no_write retirement, v1.10.5-neutral) and observability. The v1.11.1 follow-on (above) closed the open design target: it's not solvable at the loop layer.

Working tree: clean. 921 tests pass, 0 skip. Commits on main past 924af08: d50b84f (Phase A analyzer), b026295 (Phase B lever), 8a75ebe (Phase C scaffold + smoke fix), f60eb5e (RESUME), b5d71f4 (Phase B REVERT). NOT pushed, NOT tagged.

Phase A — calibration (scripts/analyze_v111_calibration.py, [[project_v111_phaseA_calibration]]): over the 71 retained Phase 3a/4 event streams. Two cross-check corrections: substrate was NOT inert (write_pressure mod departed 1.0 in 231 events); diff_stat over-counts patches (use patch_present). Decisive finding: consecutive_no_write is non-selective (precision ≤31% — read-heavy successes hit the same depths as stalls); score_trend (collapse velocity) separates empties at fire-time (step 6–8). Retired the no_write→write_pressure bias (kept).

Phase B → C → D — activation, tried and REVERTED:

  • Lever: score_trend → soft_anchor score<LOW band-response collapse promotion (breadth_probe → soft_anchor nudge) gated on conv<LOW AND step≥6 AND trend≤0.
  • Phase C (archetype-6 + 2 empties ×3 + BFCL): all gates passed — lever fires, 0 archetype regressions, BFCL 240/240. But 0 lever-attributable conversions even in the probe (seaborn-3069 took 4 nudges, didn't budge). Smoke caught + fixed a _COLLAPSE_MIN_STEP 7→6 error first.
  • Phase D (n=75 ×3 + Docker + cohort_shift_3x3): HOLD. cohort_shift = 3 deterministic losses, 0 gains. 2 are lever-caused (promotion fired 3/3): xarray-3305 strong→plausible, pylint-4661 plausible→wrong_target — premature-commitment tier demotion. 1 (pylint-4604 wrong_target→empty, 0 promotions) is substrate drift. Aggregate: empty 13→16, s+p 39→37 (both floors missed); Docker ~wash (the 3 tier losses cost 0 Docker resolves).
  • Reverted (b5d71f4): loop.py promotion + constant + flag. Kept: no_write retirement, convergence_score field, the score_trend→soft_anchor bias (observability only — shows where a future stall signal would fire).

Methodology note: the 2-rep, empty-only mid-run check read "Pareto-neutral" because the lever doesn't push to empty — it demotes tier (strong→plausible, plausible→wrong_target). Only the 3-rep full-tier cohort_shift_3x3 caught it. Lesson: judge band-response levers on full-tier cohort_shift, never empty-count alone.

v1.11.1 design target (the real deliverable): the step-6 collapse signature (conv<LOW AND trend≤0) cannot distinguish a true stall from a mid-deep-dive transient dip, so the commitment nudge derails recovering trajectories. The trend≤0 Pareto guard was necessary but not sufficient. Next lever needs a non-recovery-specific stall signal — sustained trend≤0 over K steps, or a semantic-breadth-saturation signal (the "breadth not temporal counters" direction flagged in v1.10.4) — NOT a single-step snapshot. The reverted bias + observability events are left in place to mine for it.

Earlier state — 2026-05-20 (v1.10.5 SHIPPED — first clean cohort-shift since v1.10.2)

Working tree: clean post-ship. 808 tests pass + 1 skip. v1.10.5 tagged and pushed to origin/main (commits 6f8ba67 + 9222857, tag v1.10.5 annotated with full release notes).

What to do next session

Four of the post-v1.11 roadmap tracks are now resolved by grounding (B′, C, D) or de-prioritized (A) — none of B′/C/D survived contact with the data, all at ~zero bench cost. main is at v1.10.5 behavior + no_write retirement + calibrated observability + the Track D adapter cleanup. No open blockers; nothing precommitted.

Status of the tracks:

  • B′ / v1.11.x loop-layer predicate — CLOSED (exhausted). v1.11 bench + v1.11.1 offline agree the score<LOW band is not separable with loop-observable signals. [[project_v1111_gate_design_stop]].
  • C — above-loop signaling — CLOSED (premise refuted). Locus discovery is already solved (73/75 touch the gold file early, all tiers); tracebacks rare + anti-correlated with success. Failures are a reasoning/content ceiling, not a "where" problem. [[project_trackc_locus_grounding]].
  • D — BFCL substrate hygiene — CLOSED (record-correction done). Full suite runs+grades on the current substrate (smoke-confirmed); dead bfcl_eval fallback removed; docs corrected. Residual = optional re-baseline (measurement debt), handed off below.
  • A — loop-layer modal tuning — de-prioritized. Diminishing returns; v1.11.1 is evidence the loop layer is near its ceiling.

The real frontier the grounding keeps pointing at: the remaining failure mass (wrong_target/wrong_location/empty with the locus already found) is a model reasoning/content ceiling — what to change in the right file — which sits above all of A/B′/C/D. Above-loop prompt levers have washed out against it repeatedly (v1.7–v1.11). Genuinely new directions would be model-capability-level (a re-bench if a stronger champion appears — see CLAUDE.md single-champion policy) or accepting the current ceiling and shifting to a different benchmark/value axis. This warrants a fresh user conversation, not another loop/prompt lever.

BFCL full re-baseline — DONE (2026-05-22). Agent, n=1240, current substrate, ~9h (32,256s), exit 0, 0 errors: 90.24% total, byte-identical to v1.8 every category (simple 90.25 / multiple 88.50 / parallel 87.50 / parallel_multiple 83.00 / irrelevance 100), Δ=+0.00pp across the board. Zero regression across the 6 releases since v1.8. Artifacts: acceptance/bfcl/post_v1105_full_n1235_agent/rep_1/ (gitignored). Reproduce the compare: python -m scripts.compare_bfcl --baseline acceptance/bfcl/post_specdd_v18_lever1/rep_1/summary.json --candidate acceptance/bfcl/post_v1105_full_n1235_agent/rep_1/summary.json. (A raw-mode model-capability baseline was not run; optional.)

Pinned methodology (the reusable lesson from this whole arc): ground a roadmap track's premise against the actual code/artifacts/data before treating it as actionable. B′, C, and D each looked like work and each dissolved under a cheap grounding pass (offline corpus mine / taxonomy join / artifact read + 10-min smoke). For interventions specifically: screen the gate offline for class-separability (scripts/analyze_v1111_gate_design.py is the template) — if target and protected classes aren't separable in the signal the gate reads, no threshold tuning fixes it. Judge band-response levers on full-tier cohort_shift_3x3, never empty-count alone.

v1.10.5 cycle summary (just shipped)

Headline — v1.10.5c CLEARS ALL SHIP GATES, first clean cohort-shift since v1.10.2:

metric v1.10.4 median v1.10.5 median Δ
strong 19 20 +1 (best ever)
plausible 19 19 0
s+p 38 39 +1 (best ever)
empty_patch 15 13 −2 (= v1.10.2 best)
Docker resolves 37 37 0
Apples-to-apples (56 shared) 35 36 +1 (back to v1.10.2 baseline)
Apples-to-apples BEST rep 35 37 (66.1%) best ever

Cohort-shift v1.10.5 vs v1.10.4 (3-rep × 3-rep, the cycle's strictest gate):

  • DETERMINISTIC LOSSES: 0 ← the methodology gate is CLEAR
  • DETERMINISTIC GAINS: 1 (sphinx-10323: empty 3/3 → wrong_location 3/3, byte-identical to v1.10.3 — the v1.10.4 regression is fully resolved)
  • Modal gains: 2 (astropy-14096 + sphinx-10435 cohort improvements)
  • Modal losses: 0

Archetype outcomes — 3-rep deterministic (all 6 archetypes):

  • sphinx-10435: tier improved to 2/3 strong (v1.10.4 had 1/3 strong); Docker F/F/F (within variance class)
  • matplotlib-14623: Docker 3/3 T (v1.10.4 had 2/3 due to no_report)
  • 5414: T/T/T (preserved load-bearing recovery)
  • 1921: T/T/T (improved from v1.10.4's 2/3 — substrate flake resolved)
  • sphinx-10323: wrong_location 3/3 (byte-identical to v1.10.3 7705189cbc/708b — the v1.10.4 regression target FIXED)
  • sympy-12419: T/T/T (preserved at v1.10.4 baseline; the v1.10.5b regression target stable)

The breakthrough — distinct_files topology partition: the v1.10.5c predicate narrow_reader_signal = NOT (bm25_count > 0 AND grep_count == 0 AND distinct_files >= 2) separates two mechanism-distinct failure modes that share the bm25-without-grep signature:

  • sphinx-10323 (distinct_files=2): synthesis-wandering with breadth → SUPPRESS first-event (let trajectory run v1.10.3-style; matches byte-identical patch)
  • sympy-12419 (distinct_files=1): single-file focus + premature-loop-kill → FIRE first-event (perturbs policy out of repeat-call local attractor)

Both are deterministically separable at suppression #1 with observable loop-layer signals. This is the FIRST loop-layer predicate that empirically clears all 6 archetypes simultaneously.

Cycle deliverables (uncommitted):

  • src/luxe/agents/loop.py: _v1105_synthesis_looping_signature(bm25, grep, distinct_files) + predicate integration + 2 new fields in early_bail_* events (grep_count, distinct_files)
  • tests/test_loop_write_pressure.py: 6 new v1.10.5 tests (1 unit + 5 integration)
  • benchmarks/swebench/subsets/: v1105_sphinx_10323_probe.json, v1105c_sympy_12419_probe.json, v1105c_gate2_n5.json
  • scripts/post_v1105_n75_pipeline.sh
  • Memory entries: project_v1105_predicate_probe_failure.md (updated with corrected diagnosis), project_v1105_ship_validation.md (new)
  • MEMORY.md index updated

Mechanism-design lesson (preserved in memory + lessons.md): predicate calibration must verify features at the actual event-emission point. The initial v1.10.5 predicate failure (which led to v1.10.5b smoke regression) traced to a hand-computed feature error, NOT substrate non-determinism. Substrate is fully deterministic at step 4 (verified 8 reps × 5 archetypes).

Ship recommendation: TAG v1.10.5 as a regular ship release (not substrate). First clean cohort-shift pass since v1.10.2.

Earlier state — 2026-05-19 (v1.10.4 cycle complete; ship verdict HOLD pending v1.10.5 design pass on sphinx-10323 archetype)

Working tree: 6 uncommitted v1.10.4-cycle file changes on main past origin/main (loop.py + tests/test_loop_write_pressure.py + 4 new fixtures/scripts). NO TAG, NO PUSH. 805 tests pass (801 baseline + 4 new breadth_probe tests) + 1 module-skip on bfcl_adapter.

Headline — v1.10.4 hybrid D+B band response delivers best-ever aggregate metrics but introduces one new deterministic regression:

metric v1.10.2 3-rep median v1.10.3 3-rep median v1.10.4 3-rep median
strong 18 18 19 (best ever)
plausible 19 19 19
s+p 37 37 38 (best ever)
empty_patch 15 15 15
Docker resolves (median) 39 (single rep) 35 37
Apples-to-apples on 55 shared 35 33 34 (+1 vs v1.10.3)

Cohort-shift v1.10.4 vs v1.10.3 (per-instance 3-rep × 3-rep matrix — the methodology that caught v1.10.3's hidden regression):

  • DETERMINISTIC GAIN: psf__requests-5414 (plausible→strong 3/3, Docker false→true 3/3). The cluster Docker regression that drove v1.10.3 HOLD is fully closed AND promoted.
  • DETERMINISTIC LOSS: sphinx-doc__sphinx-10323 (wrong_location 3/3 → empty 3/3). NEW regression class introduced by v1.10.4 breadth_probe.
  • Modal gains: matplotlib-14623 (empty→mixed wrong_target/empty), matplotlib-20826 (empty→wrong_location), sphinx-10435 (empty 3/3 → 1 strong + 1 plausible + 1 empty — recovered but partial)
  • Modal losses: matplotlib-25775, psf__requests-2317, sympy-11618
  • 0 strong→empty regressions vs v1.10.2 — the class that drove v1.10.3 HOLD is closed.

Archetype-4 preflight gate (post-cycle codification of the methodology)benchmarks/swebench/subsets/v1104_archetype_n4.json:

archetype v1.10.4 outcome verdict
sphinx-10435 tiers=[plausible, strong, empty] partial — recovered from v1.10.3 empty 3/3 but only 1/3 strong (criterion was ≥2/3)
matplotlib-14623 Docker [T, no_report, T] strict improvement vs v1.10.3 empty 3/3
psf__requests-5414 tiers=[strong, strong, strong], Docker [T, T, T] full win + tier promotion
psf__requests-1921 Docker [T, F, T] 2/3 preserved (was 3/3 in v1.10.3 — harness flake on byte-identical patch)

maintain_suite v1.10.4: 10/10 PASS, score 40/50, v1_release_gate=true. No regression on the foundational benchmark.

BFCL v3: SKIPPED (substrate incompatibility — bfcl_eval requires the pre-v1.10.1 tree_sitter_languages package; out of v1.10.4 scope).

The 10435/10323 mechanism duality (architectural finding of the v1.10.4 cycle):

These two archetypes are mechanism-inverses for the score<LOW band response. sphinx-10435 needs the breadth_probe nudge at suppression #1 to keep going (without it, soft_anchor at step 5 fires with wrap-up wording and terminates the trajectory with empty). sphinx-10323 needs blanket silent suppression to read enough files before committing (with breadth_probe at suppression #1, model commits a 50-line patch at step 4 that's then citation-lint-blocked for lack of read grounding).

Any binary band policy (silent vs probe) trades between them — they form a Pareto-frontier pair. v1.10.4's hybrid D+B preserves the breadth_probe fire on suppression #1 (which fixes sphinx-10435 et al.) but at the cost of sphinx-10323.

v1.10.5 direction (architectural — not yet designed):

The latent variable for the score<LOW band response should be a semantic-breadth signal (citation count, file diversity, grep coverage) — NOT a temporal counter (step number or suppression count). sphinx-10323's failure mode is that the model had a sound hypothesis (synthesizer.md shows thoughtful RST-parsing analysis) but insufficient citation grounding to pass the lint. A breadth_probe that fires conditional on tool_calls_total < N OR file_diversity < K would:

  • Suppress on sphinx-10323 (it had 4 reads + grep at step 4 — enough breadth to deserve silent)
  • Fire on sphinx-10435 (it had 4 reads at step 4 but with score=0.0 indicating no convergence — needs the nudge)

This pattern matches the user's feedback from the v1.10.4 plan-mode review: "the next-level metric would be 'semantic breadth of explored hypotheses' rather than temporal counters."

Ship recommendation: HOLD v1.10.4 pending v1.10.5 design pass. Net cohort is positive (5+ gains vs 1 deterministic + 3 modal losses) but the v1.10.3-cycle methodology — flag any new deterministic regression as HOLD-grade — applies symmetrically. Tagging v1.10.4 with sphinx-10323 as a known regression would damage the same historical-narrative-coherence the user flagged on v1.10.3.

Files changed this cycle (uncommitted):

  • src/luxe/agents/loop.py_EARLY_BAIL_MESSAGE_BREADTH_PROBE + _BREADTH_PROBE_ESCALATION_COUNT=3 constants; per-trajectory suppression_count_in_trajectory + breadth_probe_fire_count state; new env LUXE_EARLY_BAIL_BAND_RESPONSE (default breadth_probe_hybrid, opt-in legacy silent); new event kind early_bail_breadth_probe_fired
  • tests/test_loop_write_pressure.py — 4 new regression tests citing each archetype by name + updated existing test to pin LUXE_EARLY_BAIL_BAND_RESPONSE=silent for backward-compat verification
  • benchmarks/swebench/subsets/v1104_archetype_n4.json (new) — composition-style 4-archetype preflight fixture
  • scripts/audit_v1103_suppression.py (new) — full HARMLESS/HARMFUL/ORPHANED/OUTCOME_W classifier with --archetype-detail mode
  • scripts/post_v1104_n75_pipeline.sh (new) — n=75 pipeline parameterized by REP
  • Memory: project_v1104_ship_validation.md (new), project_v1103_hold_finding.md (new), project_psf_requests_5414_band_case.md (new), project_archetype_preflight_methodology.md (new), updated feedback_intervention_stacking_is_non_pareto.md

lessons.md updated with two new sections: 2026-05-18 v1.10.3 HOLD via cohort-shift methodology; 2026-05-19 v1.10.4 hybrid D+B + 10435/10323 duality.

Earlier state — 2026-05-17 (v1.10.3 SHIP HELD — mechanism shift correct but composite worse at n=1; gh-auth + Mode C bug-hunt landed; need 3-rep before tag decision)

Working tree: clean post-bench. 816 tests pass + 1 module-skip on bfcl_adapter. NO TAG, NO PUSH. Five commits sit on main past origin/main (v1.10.2 docs + 4 v1.10.3-cycle commits).

Headline — v1.10.3 single-rep n=75 + Docker harness misses ship floor on aggregate, mechanism works as designed:

Metric v1.10.3 rep_1 v1.10.2 rep_1 v1.10.2 3-rep range Verdict
strong 19 18 [17, 18] ✓ +1
plausible 18 20 [18, 20] ≈ in range
strong + plausible 37 38 [35, 38] ✓ in range
empty_patch 18 13 [13, 15] +3 outside range
wrong_target 16 19 [17, 19] ✓ −3
wrong_location 4 5 [4, 6] ✓ −1
Docker harness 33 / 75 = 44.0% 39 / 75 = 52.0% (single rep) −6 resolves (−8.0pp)
CONFIDENCE_COLLAPSE 6 (all soft_anchor) 5 (3 SA + 2 expl.) mechanism shift visible — 0 exploratory variant as designed
ABSTAIN_AFTER_INTERVENTION 6 4 ✗ +2
intervention_conversion_rate 73.3% 84.2% ✗ −10.9pp

Cross-cycle Docker delta (acceptance/swebench/post_specdd_v1103_n75/rep_1/harness/):

  • Kept: 32. Surrendered: 7. New: 1 (psf__requests-1921).
  • Surrendered breakdown:
    • matplotlib-14623 — design-accepted (W3 founding case, expected silent-failure shape under v1.10.3)
    • matplotlib-20826, matplotlib-25775, sphinx-10449 — 3 known variance-class instances per project_v1102_variance_baseline.md (could move either way on another rep)
    • psf__requests-1724, psf__requests-1766, psf__requests-5414 — 3 NEW Docker regressions not in the v1.10.2 variance catalog. The psf__requests cluster surrendering 3 instances simultaneously is the most concerning signal — needs investigation before re-ship.
  • 1 errored: scikit-learn-12682 (no_report, harness-side; not a model issue).

Mechanism evidence — working as designed:

  • 0 instances classified with msg_variant=exploratory (W3 variant fully removed)
  • All 6 CONFIDENCE_COLLAPSE are soft_anchor variant — the v1.10.3 dispatch is correct
  • early_bail_suppressed_diffuse events emitted with recent_path_diversity field populated (observability preserved per design)
  • gh-auth hardening held — sklearn-11310 + sklearn-11578 (the v1.10.2 gh-auth flake casualties) BOTH completed cleanly this cycle
  • No test regressions, no crashes

Ship decision: HOLD (do NOT tag). Reasoning:

  • The single-rep aggregate misses the v1.10.2 ship floor (empty_patch +3 above range; Docker −6 vs rep_1).
  • 3 new Docker regressions on the psf__requests cluster don't fit the v1.10.2 variance catalog — could be (a) coincidence variance, (b) hidden cost of silent-suppression on a fixture class v1.10.1's exploratory variant was quietly helping.
  • Per feedback_ship_floor_needs_multirep_when_at_strictness.md: single-rep gates within ±1 of cycle baseline are noise; ±3 is above noise but n=1 can't separate signal from variance.
  • 3-rep replication is the next action. Two reps × ~5h each = ~10h additional wall.

v1.10.3 commits (on main, not pushed, not tagged):

  • ff5f5dfdocs: v1.10.3 code-complete (← this file; will be updated)
  • 3c72d92v1.10.3: revert W3 exploratory variant to v1.10 silent-suppression
  • 833d2caprompts: regression guards for reverted Mode C citation-grounding directive
  • 03df904pr: harden gh-auth preflight — API probe, 5-attempt retry, classifier, TTL cache

The gh-auth + Mode C-guard commits are independent of the W3 decision — they ship regardless. If 3-rep confirms v1.10.3 W3 regression, the option is:

  1. Revert 3c72d92 (W3 silent-suppression) and keep gh-auth + Mode C guards on main; v1.10.3 cycle terminates without a tag.
  2. Investigate the psf__requests cluster + iterate prompt-band design before re-running 3-rep.

Earlier state — 2026-05-17 morning (v1.10.3 code-complete, n=4 smoke clean on mechanism evidence — superseded by n=75 above)

Today's commits (atop v1.10.2):

  • 03df904pr: harden gh-auth preflight — API probe (gh api user --jq .login), 5-attempt retry [0, 0.5, 1.5, 5, 15]s with 10s per-attempt timeout, failure-kind classifier (network|auth|rate_limit|binary_missing|unknown), 90s per-suite TTL cache, structured logging via luxe.pr.gh_auth. project_gh_auth_flake.md hardened; awaiting 3 clean cycles to close.
  • 833d2caprompts: regression guards for reverted Mode C citation-grounding directive — Mode C Step 1 shipped + reverted same day after 3-rep nothing-doc-config A/B showed 0 citations on 1/3 reps + "Stuck in loop" abort on 2/3. Two-imperative wording ("call another tool" AND "omit as last resort") gave the model divergent exits. Lesson saved as feedback_citation_grounding_caused_loop_and_avoidance.md.
  • 3c72d92v1.10.3: revert W3 exploratory variant to v1.10 silent-suppression — restored v1.10 silent-suppression in score<LOW band; kept recent_path_diversity helper + emission on the suppression event as observability (not a gate trigger). _EARLY_BAIL_MESSAGE_EXPLORATORY constant deleted; "exploratory" mode key removed. outcomes.py classifications preserved for stale-log back-compat.

v1.10.3 smoke (benchmarks/swebench/subsets/v1102_probe_n4.json, 4 fixtures, wall 17m41s, acceptance/swebench/v1103_smoke/rep_1/):

Fixture Result Mechanism evidence Verdict
sympy-13031 (W2) empty, clean exit step 20 early_bail(soft_anchor, score=0.25) + action_density + write_pressure → habituation_exit ✅ unchanged from v1.10.1
matplotlib-14623 (W3 founding) empty, loop-abort step 14 11× suppressed_diffuse (score=0, div=2) — NO message lands in chat ⚠️ accepted regression — exact v1.10 silent-failure shape that W3 traded for pylint protection. Per design.
pylint-6528 (W3 collateral) empty, clean exit step 12 3× suppressed_diffuse (steps 4-6) → score rose; soft_anchor fired step 7 n=1 within v1.10.2's 2/3-empty variance; needs 3-rep to compare cleanly
sphinx-10323 (W3 collateral 2) patch_len=708, clean 12× suppressed_diffuse + write_pressure + post_write_idle_exit recovered to non-empty

Mechanism verification PASS on all 4: suppression event carries recent_path_diversity as designed; early_bail_fired no longer carries msg_variant=exploratory anywhere; outcomes.py back-compat with stale logs preserved; no test regressions; no rc=2 / no run_id events (gh-auth hardening held).

v1.10.3 ship gates (per feedback_ship_floor_needs_multirep_when_at_strictness.md):

  • n=4 smoke is a SUBSTRATE-STABILITY signal, not a ship-floor signal. Single-rep gates within ±1 of cycle baseline are noise.
  • A defensible n=75 ship-floor would require 3-rep replication on the variance-class instances (pylint-6528 included). Optional next step if user wants ship-grade evidence.
  • The code revert itself is correct: mechanism behaviors fire as designed, tests pass, no crashes. Defensible to tag based on mechanism evidence + previous v1.10.2 baseline as the population-level prior.

Earlier state — 2026-05-16 (v1.10.2 n=75 3-rep variance baseline — empty_patch range [13, 15] mean 14.3; rep_1 ship was best-of-3; pylint-6528 W3 collateral confirmed; v1.10.3 brief unchanged but firmer)

Working tree: clean. 801 tests pass + 1 module-skip on bfcl_adapter (was 781; +20 from a pre-bench pip install -e . re-pin that picked up modules; net code change for the day is the _do_test timeout cap in commit 3c3b79b). No new tag — the variance baseline is a measurement on the v1.10.2 substrate, not a ship.

Today's headline: the v1.10.2 ship-cycle's "empty_patch = 13 — floor finally hit" was best-of-3. rep_2 and rep_3 both hit 15. Substrate is healthy and deterministic-on-the-strong/plausible-tiers; the wrong_target/wrong_location/empty_patch borderline carries ~±2 instances of noise. The v1.10.3 brief was already pointing at a W3 revert; the rep_2 + rep_3 evidence on pylint-6528 (empty in 2 of 3) firms that decision considerably.

3-rep tally (acceptance/swebench/post_specdd_v1102_n75/rep_{1,2,3}/):

metric rep_1 rep_2† rep_3 mean range
strong 18 17 18 17.7 [17, 18]
plausible 20 18 19 19.0 [18, 20]
strong+plausible 38 35 37 36.7 [35, 38]
empty_patch 13 15 15 14.3 [13, 15]
wrong_target 19 17 19 18.3 [17, 19]
wrong_location 5 6 4 5.0 [4, 6]

† rep_2 ran with n=73 — scikit-learn-11310 + scikit-learn-11578 bailed at rc=2/122s during a brief mid-bench internet outage (the documented gh-auth flake from project_gh_auth_flake.md). Both deterministic across rep_1 + rep_3 (plausible/plausible, strong/strong), so the normalized rep_2 estimate is {strong=18, plausible=19, s+p=37, empty=15} — wash with rep_3. See lessons.md 2026-05-16a for the deadlock cascade on the retry attempt.

Variance-class catalog (6 instances; exclude from single-cycle pass/fail signals):

instance rep_1 rep_2 rep_3 class
astropy-14096 empty wrong_loc empty bouncer (known, 4+ reps)
matplotlib-20826 wrong_loc wrong_loc wrong_target borderline locus
matplotlib-25775 plausible empty wrong_target 3-way unstable (new)
pylint-6386 wrong_target wrong_target empty 1-of-3 outlier
pylint-6528 wrong_target empty empty W3 collateral (confirmed)
sympy-13091 wrong_target empty wrong_target 1-of-3 outlier

Real flip rate: 6/73 = 8.2%. 67 of 75 stable across reps. Strong-tier and plausible-tier classifications are essentially deterministic at temp=0; variance lives entirely in the wrong-locus / empty boundary.

Two incidents closed during the cycle:

  1. gh-auth flake recurred during rep_2 — cost 2 sklearn datapoints. Mitigation (assert_gh_auth() 3× retry) is sound; the network was down longer than the retry window. Updated project_gh_auth_flake.md with the 2026-05-16 occurrence (still open as a luxe-level concern, not promoted to closed).
  2. _do_test had timeout=None — sklearn-11310 retry hung 25 min in pytest after workspace state pollution from a killed prior run. Fixed in commit 3c3b79b: PRConfig.test_timeout_s (default 600s), subprocess.TimeoutExpired caught and recorded as rc=124 cleanly. rep_3 ran with the fix and didn't need to fire it (no deadlock recurred). New regression test test_do_test_timeout_records_clean_failure. New memory entry feedback_test_step_needs_wall_cap.md generalizes the pattern: any luxe step or bench harness shelling out to a user-controlled command MUST set a wall cap.

Implications for v1.10.3 / v1.11 ship gates (delta vs prior brief):

  • Adopt median-of-3 for empty_patch gates, or loosen the floor to ≤15 single-shot. The ≤13 single-rep gate is unsupportable given the measured range. Saved as feedback_ship_floor_needs_multirep_when_at_strictness.md. strong + plausible (range [35, 38]) is more variance-robust and should be the primary ship signal.
  • v1.10.3 W3 revert is firmly supported — pylint-6528 evidence is now 2/3 reps empty, not a single observation. The non-Pareto trade-off is real; silent suppression in score<LOW band remains the best move. The v1.10.2 design brief item #1 (trajectory-shape signals for late-commit vs stopped-responding) is still the right next architectural direction, but the immediate v1.10.3 surface stays small: revert.
  • v1.11 lever sizing partially revised — the v1.10.2 ship report's wrote_to_some_gold_partial: 16 instances at 31.2% Docker rate cross-tab is from rep_1's single observation. Worth re-deriving from the 3-rep union (or running v1.11's lever on a single rep with the variance-class instances excluded from the credit/discredit accounting). matplotlib-25775's variance class also means the v1.10.2 "+1 Docker resolve" needs the same caveat — that win may not survive replication.

v1.10.3 design brief (unchanged from the v1.10.2 ship-cycle brief, with the W3 revert firmed up):

  1. Revert v1.10.1 W3 exploratory variant back to v1.10 silent-suppression in score<LOW band. Keep recent_path_diversity helper + logging as observability for future cycles, but stop using the signal as a gate trigger.
  2. v1.11 locus-disambiguation lever — pre-commit "did you miss any files?" prompt scoped to wrote_to_some_gold_partial bucket. Re-derive the bucket size from a 3-rep union before sizing the expected Docker delta.
  3. Trajectory-shape signals (post-bail tool_call rate, grep vs read ratio in rescue window) remain queued for after v1.10.3 ships; they're the long-term answer to the non-Pareto problem but unbudgeted for this cycle.

Reproduce the variance report:

python -m scripts.variance_v1102_3rep \
    --rep acceptance/v1102_taxonomy/v1102_n75_full_stack_swebench.json \
    --rep acceptance/v1102_taxonomy/v1102_n75_rep_2_full_stack_swebench.json \
    --rep acceptance/v1102_taxonomy/v1102_n75_rep_3_full_stack_swebench.json

(All taxonomy artifacts are gitignored per project convention; reproducible from predictions.json in the corresponding rep dirs via the established compare_v110.classify_arm pipeline.)

Commits today on main:

  • 3c3b79bpr: cap _do_test wall time to defend against subprocess deadlock (the _do_test timeout fix)
  • 882eaf0scripts: v1.10.2 3-rep variance analyzer + gh-auth rerun subset
  • (this current commit) — docs: lessons + RESUME state update

Earlier state — 2026-05-15 (v1.10.2 SHIPPED — empty_patch floor HIT at 13; Docker-WIN +1 resolve; conversion_rate 84.2%)

Working tree: clean post-tag. 781 tests pass + 1 module-skip on bfcl_adapter. v1.10.2 tagged + pushed to origin (annotated, signed). Released atop v1.10.1.

v1.10.2 ship character — first cycle to hit empty_patch floor + cleanest cycle on multiple axes:

  • empty_patch = 13 — the floor (≤13) target first set at v1.7 is HIT for the first time. v1.10 was 14, v1.10.1 was 16; v1.10.2 = 13.
  • Docker harness: 39/75 = 52.0% (+1 resolve vs v1.10.1's 38, +1.3pp). Second consecutive Docker-WIN cycle.
  • intervention_conversion_rate = 84.2% — best ever. v1.10 was 80.9% (then-record); v1.10.1 dipped to 77.6%; v1.10.2 recovers to 84.2% (+6.6pp vs v1.10.1).
  • CONFIDENCE_COLLAPSE: 5 (3 SOFT_ANCHOR + 2 EXPLORATORY). v1.10.1 had 8. Reduction of 37.5%.

Phase D n=75 result (5h26m wall, acceptance/swebench/post_specdd_v1102_n75/rep_1/):

Metric Target v1.10.2 v1.10.1 baseline Δ
empty_patch ≤13 13 ✓ floor HIT 16 −3
strong ≥18 18 ✓ 18 0
plausible 20 20 0
wrong_target 19 17 +2
wrong_location 5 4 +1
strong + plausible ≥35 38 38 0
intervention_conversion_rate ≥75% 84.2% 77.6% +6.6pp
CONFIDENCE_COLLAPSE (total) =0 5 8 −3
.. SOFT_ANCHOR variant 3 4 −1
.. EXPLORATORY variant 2 4 −2
ABSTAIN_AFTER_INTERVENTION ≤5 4 7 −3
Docker harness (overall) ≥38 39 / 75 = 52.0% 38 / 75 = 50.7% +1 resolve (+1.3pp)
Docker harness (patched) 39 / 62 = 62.9% 38 / 59 = 64.4% −1.5pp on larger denom

Cross-cycle Docker delta:

  • Kept resolves: 37 (Docker-resolved both cycles)
  • Surrendered: 1 — sphinx-doc__sphinx-10673 (same-tier Docker demotion; patch_len GREW 2990→3397 this cycle and lost alt-solution credit — opposite shrinkage pattern from v1.10→v1.10.1)
  • New resolves: 2 — matplotlib-25775 (v1.10.1 empty → v1.10.2 plausible + Docker-resolved), sphinx-doc__sphinx-10449 (v1.10.1 empty → v1.10.2 wrong_target + Docker-resolved via alt-solution)

v1.11 substrate signal (the write-locus cross-tab — Item 3's deliverable):

bucket n Docker resolved rate
wrote_to_all_gold 43 32 74.4%
wrote_to_some_gold_partial 16 5 31.2% ← v1.11 lever target
wrote_to_non_gold_only 3 2 66.7% (small sample)
never_wrote 13 0 0.0%

The wrote_to_some_gold_partial bucket of 16 instances at 31.2% Docker rate is the load-bearing v1.11 lever target. A pre-commit "did you miss any gold files?" prompt that converts even half of them to wrote_to_all_gold (74.4% rate) would yield: 8 instances × (0.744 − 0.312) ≈ +3 Docker resolves, pushing v1.11 toward 42/75 = 56% overall.

Item 2 (CONFIDENCE_COLLAPSE split) restored causal attribution:

  • v1.10: 4 SOFT_ANCHOR + 0 EXPLORATORY = 4 total
  • v1.10.1: 4 SOFT_ANCHOR + 4 EXPLORATORY = 8 total (the +4 was net new W3-induced)
  • v1.10.2: 3 SOFT_ANCHOR + 2 EXPLORATORY = 5 total (both classes shrunk)

The class split confirmed v1.10.1's headline "8 confidence_collapse" was carryover + net-new exploratory damage. v1.10.2 reduced BOTH classes — the diversity gate's minimal-trajectory fallback rarely fired in this cycle's variance band, but the metric refinement gives clean cycle-over-cycle attribution.

Item 1 (conditional exploratory) shipped as REDUCED scope after probe-driven revert:

  • recent_path_diversity helper + threshold=2 minimal-trajectory fallback shipped (rarely fires; observability win)
  • Step-based AND immediate post-exploratory escalation IMPLEMENTED, TESTED, and REVERTED before ship when the n=4 probe revealed single-mechanism escalation is non-Pareto: pylint-6528 NEEDED escalation pressure to commit; matplotlib-14623 was on a successful late-commit trajectory that escalation cascaded into habituation_exit (0 writes). Same intervention sequence, opposite outcomes. v1.10.3 needs trajectory-shape signals (post-bail tool_call rate, grep vs read ratio in rescue window), not a single step-based predicate. See lessons.md 2026-05-15 entry.

Substrate plumbing shipped (durable across future cycles):

  • scripts/compare_v110.py: compute_locus_metrics (write-locus + reconnaissance combined); annotate_patch_len_deltas (Item 4 from v1.10.1)
  • scripts/analyze_v110_harness.py: 4-bucket write-locus × Docker cross-tab; separate informational reconnaissance section
  • scripts/backfill_v110_taxonomy.py (NEW): regenerates v1.10 + v1.10.1 + v1.10.2 taxonomies with the CONFIDENCE_COLLAPSE variant split
  • scripts/post_v1102_n75_pipeline.sh (NEW): orchestration shell mirroring v1.10.1's pipeline
  • src/luxe/agents/convergence.py: recent_path_diversity topology signal (separate from convergence-score confidence scalar)
  • src/luxe/agents/outcomes.py: FailureClass.CONFIDENCE_COLLAPSE_SOFT_ANCHOR / _EXPLORATORY + msg_variant capture
  • benchmarks/swebench/subsets/v1102_probe_n4.json (NEW): 4-instance regression probe set

v1.10.3 design brief (small surface; targets the non-Pareto escalation problem):

  1. Trajectory-shape signals for late-commit vs stopped-responding discriminator: post-bail tool_call rate (matplotlib: kept reading; pylint: stopped); grep vs read ratio in 4-step rescue window; first_correct_file_touch_step relative to bail. The discriminator must be available at fire-time of any conditional intervention.
  2. v1.11 locus-disambiguation lever: pre-commit prompt for the 16 partial-coverage instances asking the model to verify all gold-target files have been considered. Sized against the now-trustworthy write-locus cross-tab signal.
  3. Re-examine astropy-14096 variance class: bounced wrong_location → plausible → empty across v1.10/v1.10.1/v1.10.2. 3-rep diligence to confirm whether the substrate is stable enough for ship-gate strictness or whether this is the v1.4-era "borderline doc/manage" variance pattern resurfacing.

Ship-or-hold decision (shipped): All six ship-gate criteria pass. v1.10.2 is the cleanest cycle since v1.10 on multiple axes (empty_patch hit, conversion-rate new high, CC reduction). Tag created.

Earlier state — 2026-05-15 (v1.10.1 SHIPPED — Docker-WIN +2 resolves; inspector composite acknowledged miss; v1.10.2 design brief queued)

Working tree: clean post-tag. 763 tests pass + 19 module-skip on bfcl_adapter. v1.10.1 tagged + pushed to origin (annotated, signed). Released atop v1.10.0 as a Docker-grader release — the practical model-utility metric (Docker resolves) moved +2 vs v1.10 (48.0% → 50.7%) while the strict inspector-tier composite missed CONFIDENCE_COLLAPSE = 0 and empty_patch ≤ 13. User shipped on the Docker-WIN reading rather than holding for v1.10.2 wording iteration; the W3 collateral cases (2 confirmed) are addressed in v1.10.2.

n=75 Phase D result (5h53m wall, acceptance/swebench/post_specdd_v1101_n75/rep_1/):

Metric Target v1.10.1 v1.10 baseline Δ
empty_patch ≤13 16 ✗ (miss by 3) 14 +2
strong ≥18 18 19 −1
strong + plausible ≥35 38 38 0
intervention_conversion_rate ≥75% 77.6% 80.9% −3.3pp
CONFIDENCE_COLLAPSE =0 8 ✗ (+4) 4 +4
ABSTAIN_AFTER_INTERVENTION ≤5 7 ✗ (+2) 4 +3
Docker harness (overall) ≥36 38 / 75 = 50.7% 36 / 75 = 48.0% +2 resolves (+2.7pp)
Docker harness (patched) 38 / 59 = 64.4% 36 / 61 = 59.0% +5.4pp on smaller denom

Cross-cycle Docker delta:

  • Kept resolves: 34 (Docker-resolved in both cycles)
  • Surrendered: 2 — astropy-14096 (v1.10 wrong_location/Docker-resolved → v1.10.1 still patched but Docker-failed), psf__requests-1921 (strong-tier silent demotion; patch shrank 495 → 489 chars, lost alt-solution credit)
  • New resolves: 4 — matplotlib-14623 (the W3 founding test, v1.10 empty → v1.10.1 strong + Docker-resolved), matplotlib-20826, psf__requests-5414, sphinx-doc__sphinx-10673 (silent demotion of v1.10 RECOVERED in v1.10.1!)

Per-tier Docker rates (v1.10.1): strong 15/18 = 83.3%, plausible 13/20 = 65.0%, wrong_target 8/17 = 47.1%, wrong_location 2/4 = 50.0%. wrong-locus Docker rate climbed substantially (v1.10: 35.3%/40.0% → v1.10.1: 47.1%/50.0%) — wrong-locus patches are converting on Docker at a higher rate now, which is what netted the +2 despite the patched-count drop (61 → 59).

Inspector-tier composite missed because of two related dynamics:

  1. 3 v1.10 → v1.10.1 regressions vs 1 recovery (net +2 empties):

    • Recovered: matplotlib-14623 (v1.10 empty → v1.10.1 strong + Docker-resolved) — the W3 founding test, full success
    • Regressed: pylint-dev__pylint-6528 (v1.10 wrong_target → v1.10.1 empty) — confirmed W3 collateral: exploratory variant fired at score=0.0, model interpreted the permissive "you may begin" framing as license to keep exploring instead of committing the wrong-locus candidate it had
    • Regressed: sphinx-doc__sphinx-10323 (v1.10 wrong_location → v1.10.1 empty) — confirmed W3 collateral: same exploratory variant + score=0.0 pattern
    • Regressed: pylint-dev__pylint-6386 (v1.10 wrong_target → v1.10.1 empty) — NOT W3 collateral: msg_variant=soft_anchor at score=0.25 (mid-band, same wiring as v1.10). Likely bench variance on a wrong_target instance (per feedback_replicate_borderline_fixtures.md, wrong_target has measurable temp=0 variance from substrate-state effects).
  2. CONFIDENCE_COLLAPSE 4 → 8 is partly a visibility artifact: the class is defined as "empty + writes=0 + EARLY_BAIL fired." Under v1.10, score < LOW suppressed EARLY_BAIL silently, so collapsed-but-suppressed trajectories did NOT appear in the count. Under v1.10.1, the same trajectories fire EARLY_BAIL with the exploratory variant; if they then go empty, they correctly classify as CONFIDENCE_COLLAPSE. So part of the +4 delta is better measurement of a class that was already there, not a strict regression. The taxonomy class definition needs an audit refinement (e.g., split into confidence_collapse_under_soft_anchor vs confidence_collapse_under_exploratory).

The W3 architectural trade-off (matches the audit reviewer's preemptive warning verbatim):

  • W3 succeeds on matplotlib-14623 archetype (no commit → commit) — the original target class.
  • W3 introduces a new failure mode for trajectories that were producing wrong-locus patches under v1.10's silent suppression: the permissive "you may begin attempting a small corrective edit when you have a candidate" reads as "keep exploring until you have a candidate" on wrong-locus paths, dissolving the implicit commit pressure that v1.10's silence preserved by default.
  • Net Docker outcome: the recovery (+1 Docker resolve on matplotlib-14623, +1 on sphinx-10673 alt-solution) outweighs the regressions (the 3 inspector-tier regressions were ALL Docker-failed in v1.10 anyway — no Docker resolves lost from those).

The W2 lever (habituation exit) is a clean win: sympy-13031 fired the predicate at step=20 with zero post-intervention writes, terminating cleanly instead of burning max_steps. No collateral observed; predicate is conservative enough (3 distinct kinds AND step ≥20 AND no post-intervention write) that no v1.10-passing trajectory was caught by it.

Ship decision (2026-05-15): shipped as Docker-WIN. The +2 Docker resolves represent practical model utility improvement; the inspector composite floor was already missed on v1.10 (CONFIDENCE_COLLAPSE 4 was non-zero); holding the cycle for further iteration would have delayed the v1.10.2 work that targets the residual W3 collateral. The architectural ship pattern matches v1.9 + v1.10 — incremental Docker-grader gains across substrate cycles.

v1.10.2 design brief (small surface; targets the W3 collateral specifically — next cycle starts here):

  1. Make exploratory variant conditional on file-touch novelty: fire exploratory only when the trajectory has touched ≥ N distinct file paths in the last K steps (i.e., truly diffuse, not focused-but-low-score). For pylint-6528-class trajectories that had a candidate file but low score, don't fire exploratory — fall back to soft_anchor.
  2. Audit the CONFIDENCE_COLLAPSE class definition: separate "soft_anchor collapse" from "exploratory collapse" so the metric distinguishes message-induced failure modes. Update outcomes.py enum + classifier.
  3. Diligence the W5 gold-file extraction: the never_touched_gold + touched_before_intervention_but_after_write buckets were empty in the v1.10.1 patched cohort, which doesn't match expectations. Investigate parse_gold_target_files() in scripts/compare_v110.py — likely a path-prefix or unicode issue. Must fix before v1.11 lever design depends on the cross-tab signal.

File trail (v1.10.1 cycle):

  • acceptance/swebench/post_specdd_v1101_n75/rep_1/ — full bench artifacts incl. predictions, harness summary, manifest, taxonomy
  • acceptance/v1101_taxonomy/v1101_n75_full_stack_swebench.json — v1.10.1 taxonomy with patch_len_delta, prior_patch_len, and W5 locus fields (gold_target_files, first_correct_file_touch_step, correct_touch_before_first_write, correct_touch_relative_to_intervention)
  • scripts/validate_v1101_probe.py, scripts/analyze_v1101_smoke.py, scripts/post_v1101_n75_pipeline.sh — re-runnable pipeline scripts
  • benchmarks/swebench/subsets/v1101_probe_n2.json — minimal regression-test subset (sympy-13031 + matplotlib-14623)

Notable: substrate hygiene fixes from this cycle (already shipped on main, separate from v1.10.1 lever changes):

  • src/luxe/agents/loop.py log_calls default-on (footgun closed)
  • benchmarks/swebench/run.py preflight __editable__*.pth grep (substrate isolation enforced)
  • pyproject.toml swap to tree_sitter_language_pack (Python 3.14 wheel gap closed)

Earlier state — 2026-05-14 evening (v1.10.1 substrate complete; smoke validation underway)

Working tree: clean. 763 tests pass + 1 module-skip on bfcl_adapter (= 19 tests gated on bfcl_eval which is permanently incompatible with the v1.10.1 tree_sitter_language_pack pin; documented in tests/test_bfcl_adapter.py importorskip). No new tag yet — v1.10.1 ships when the full smoke + n=75 + Docker gates clear.

v1.10.1 substrate — code complete (commits 6d1709e, d3bf3d9 on origin/main). Six workstreams shipped:

# Workstream Status
W1 tree_sitter_languagestree_sitter_language_pack==0.13.0 + tree-sitter 0.25.x swap ✅ 15 fail → 0 fail; pyproject.toml re-pinned
W2 Habituation clean-exit predicate (≥3 distinct interventions + no post-intervention write + step ≥ 20 → clean break) ✅ predicate + FailureClass.HABITUATION_EXIT + 3 unit tests
W3 Exploratory-support variant for convergence_score < LOW band ✅ replaces v1.10 silent suppression; three-band dispatcher + 3 unit tests
W4 patch_len_delta + same_tier_docker_demotion detection ✅ sphinx-10673 surfaces with Δ=−1686 on real data
W5 first_correct_file_touch metric (v1.11 substrate) ✅ 4 new taxonomy fields + locus × Docker cross-tab in analyzer
W6 Cycle ritual updates + bench-launch __editable__*.pth preflight grep ✅ Docker harness mandatory pre-ship-doc; preflight fails fast on swebench-workspace leaks
+ log_calls default-on (silent footgun caught by probe) ✅ intervention events now logged unless LUXE_SUPPRESS_TOOL_LOG=1

2-instance probe validated both v1.10.1 levers end-to-end (acceptance/swebench/v1101_probe_n2/rep_1/, ~10m wall):

  • sympy-13031 (W2 regression test) — All 3 commitment interventions fired (ACTION_DENSITY_GATE, EARLY_BAIL, WRITE_PRESSURE). habituation_exit event emitted at step=20 (exact predicate boundary). Zero post-intervention writes. Trajectory exited cleanly → ~10-15 min wall saved per habituated instance at scale. Outcome: empty_patch (the predicate doesn't rescue lost trajectories, it exits them cheaply).
  • matplotlib-14623 (W3 regression test) — early_bail fired with msg_variant='exploratory', convergence_score=0.0 (well below LOW threshold 0.10). Produced 24-line patch on lib/matplotlib/ticker.py (LogLocator swapped-vmin/vmax fix) — was empty under v1.10. The previously-silent failure class is now measurably moved.

n=14 smoke PASSED all three ship-gate criteria (acceptance/swebench/v1101_smoke_n14/rep_1/, 66m wall):

  • 0 new regressions vs v1.10 (composition identical: 12/14 patched in both cycles; the 2 empties — sympy-13031 + seaborn-3069 — were empty in v1.10 too).
  • habituation_exit fires ≥ 1: sympy-13031 exited cleanly at step=20.
  • exploratory variant fires ≥ 1: 7 instances fired exploratory at score=0.0 (the diffuse-recon archetype, matplotlib-14623 shape). Distribution across 14: 7 exploratory + 5 soft_anchor + 1 commit_imperative + 1 no fire. Same-outcome under new wiring means the lever didn't BREAK any trajectories that were already converging; the previously-silent band now has a measurable, low-pressure message that doesn't regress passing cases.

Active background task: n=75 against benchmarks/swebench/subsets/v1_baseline_n75.json (exact 75-instance match with v1.10's cohort, apples-to-apples), output to acceptance/swebench/post_specdd_v1101_n75/rep_1/. Expected wall ~5-6h based on smoke pace (~5 min/instance). On completion: save run_id_manifest → Docker harness (~35m) → analysis → ship-gate evaluation.

Ship-gate progress:

Gate Target Status
W1 unit tests 765/765 collected, 0 failing ✅ done
2-instance probe W2 + W3 events fire correctly under real model ✅ done
n=14 smoke Zero new regressions vs v1.10; habituation + exploratory both fire ✅ done
n=75 (~5-6h) empty_patch ≤ 13, conversion_rate ≥ 75%, 0 new regressions 🟡 running
Docker harness (~35m) net resolves ≥ v1.10's 36 ⏳ pending n=75
W4 + W5 real-data check silent_demotion + locus cross-tab in output ⏳ pending n=75

Earlier state — 2026-05-14 (v1.10.0 SHIPPED — mechanism-isolation cycle; floor narrowly missed, conversion +17.9pp)

Working tree: clean post-tag. 765 tests collected; 750 pass on MyEnv (Python 3.14) — 15 fail uniformly on import tree_sitter_languages (package unmaintained, no Python 3.14 wheels; successor tree_sitter_language_pack is installable but requires a one-line swap in src/luxe/symbols.py:159 — queued as v1.10.1 substrate work, no logic regression). v1.10.0 tagged locally (annotated, signed; push status set below). Released atop v1.9.0 with the v1.10 cycle data preserved at acceptance/swebench/post_specdd_v110_n75/rep_1/ (with run_id_manifest.json) and acceptance/v110_taxonomy/.

2026-05-14 audit correction: The test-count line previously read "765 tests passing" unqualified. Manual review on 2026-05-14 caught that four __editable__.*.pth files from swebench-workspace fixture clones (pytest-5840, sympy-12481, xarray-2905, requests-2931) had leaked into ~/.venvs/MyEnv/site-packages and were shadowing real pytest (and providing fake sympy/xarray/requests). All earlier "tests passing" claims in this venv were running against the leaked pytest from the fixture-clone source tree, not a real install. Cleaned up (4 .pth + 3 finder modules + 4 dist-info dirs removed); preflight invariant added (see feedback_swebench_pip_editable_pollution.md); real pytest 9.0.3 reinstalled.

v1.10 ship character — second substrate release in a row, but with substantive empty_patch movement. The literal empty_patch ≤13 floor missed by 1 (14 empties); the intervention_conversion_rate mechanism-level signal jumped from 63.0% to 80.9% (+17.9pp). Best empty_patch count of any luxe cycle (ties v1.5 v1's 14). Two specific regressions diagnosed and have clean v1.10.1 paths.

Phase D n=75 result (3h42m wall, run 2026-05-13 21:54 → 2026-05-14 01:36):

Metric Target v1.10 n=75 v1.9 full-stack Δ
empty_patch ≤13 14 ✗ (miss by 1) 19 −5
strong ≥18 19 20 −1
strong + plausible ≥35 38 38 0
intervention_conversion_rate ≥50% 80.9% 63.0% +17.9pp
CONFIDENCE_COLLAPSE =0 4 ✗ 0* +4
ABSTAIN_AFTER_INTERVENTION ≤5 4 ✓ 0* +4
Docker harness (patched) 36 / 61 (59.0%) 34 / 56 (60.7%) −1.7pp on larger denom
Docker harness (overall) 36 / 75 (48.0%) 34 / 75 (45.3%) +2.7pp

* The v1.9 full-stack baseline shows 0 for CONFIDENCE_COLLAPSE and ABSTAIN_AFTER_INTERVENTION only because of a workspace-overwrite bug: ARM 2 (gate-only, LUXE_EARLY_BAIL OFF) overwrote ~/.luxe/swebench-workspace/<instance>/log/stdout.log before the v1.9 taxonomy backfill ran, so the saved v1.9 taxonomy reflects ARM 2's events on ARM 1's predictions. The TRUE v1.9 full-stack CONFIDENCE_COLLAPSE count is unknown but almost certainly > 0. v1.10's run_id_manifest.json (saved immediately after the n=75 run via scripts/save_run_id_manifest.py) closes this bug; v1.10's 4 is the first honest measurement.

Docker harness result (run 2026-05-14, 34m41s wall, acceptance/swebench/post_specdd_v110_n75/rep_1/harness/harness_summary.json):

  • Net delta: +2 resolves vs v1.9 (36 vs 34). v1.10 ships as Docker-WIN by a narrow margin. Patched-rate dropped 1.7pp because v1.10 produces 5 more patches (61 vs 56) — the larger denominator absorbs the gain; the overall rate (which is the apples-to-apples comparison, both arms have n=75) moves +2.7pp.
  • 4 new resolves: astropy-14096 (v1.9-empty recovery → Docker ✓), django-10973, psf__requests-1724, pydata__xarray-3095 (v1.9-empty recovery → Docker ✓).
  • 2 surrendered resolves: matplotlib-14623 (v1.10 regression to empty, the named diagnosis) and sphinx-doc__sphinx-10673 (silent regression — inspector tier stayed wrong_target both cycles, but the v1.10 patch shrank 3345 → 1659 chars and lost Docker's alternative-solution credit; not caught by inspector grader; v1.10.1 mining candidate).

Per-tier Docker resolution (intersected with has_patch=True only — empty_patch is structurally un-runnable and omitted to avoid diluting the denominator):

Tier n_with_patch n_resolved rate
strong 19 17 89.5%
plausible 19 11 57.9%
wrong_target 17 6 35.3%
wrong_location 5 2 40.0%
new_file_in_diff 1 0 0.0%

Thesis checks (predicted ahead of the run, confirmed after):

  • A. Regression-loss thesis (matplotlib-14623 ∈ v1.10 empties → no Docker entry, surrender confirmed): TRUE. It was Docker-resolved on v1.9 and is absent from the v1.10 harness output.
  • B. Recovery-gain thesis (≥3–4 of the 7 v1.9-empty → v1.10 non-empty recoveries should resolve to net positive): 2 of 7 resolved on Docker (astropy-14096 and xarray-3095). Below the predicted band but enough — combined with the unrelated new resolves on django-10973 and requests-1724 — to deliver +2 net.

Reading: v1.10 is a Docker win, but a thinner one than the inspector-tier picture suggests. The +17.9pp mechanism-conversion gain converts mostly to more patches rather than more resolved patches — the strong tier resolves at 89.5%, but wrong_target/wrong_location at 35–40% means producing more wrong-locus patches barely budges the harness number. The v1.10.1 brief's mechanism-habituation gate and exploratory-support variant are still the right next levers; an additional finding is that the wrong_target → empty regression class (sphinx-10673) needs a separate audit because the inspector taxonomy doesn't surface patch-shrinkage on same-tier instances.

Regression instances (2 single-instance regressions vs v1.9 full-stack):

  • sympy__sympy-13031 strong → empty: ALL THREE interventions fired (soft_anchor early_bail at step 4, post_bail_rescue density gate at step 9, write_pressure at step 15). 30 tool calls, 0 writes. Intervention habituation — same v1.9-substrate pattern, not a v1.10 lever bug. Persisted v1.10.1 work item.
  • matplotlib__matplotlib-14623 wrong → empty: convergence_score stayed at 0.0 for 12 consecutive steps, suppressing early_bail every step. Pure diffuse-recon trajectory (no rereads, no greps, no preview-before-write) → no commitment nudge. The reviewer's preemptive concern came true: we shipped the suppression without the exploratory-support variant. Clean v1.10.1 lever (add diffuse-recon fallback message).
    • Docker-grader impact: matplotlib-14623 was Docker-resolved on v1.9 (alternative-solution credit despite inspector wrong_target tier — see acceptance/swebench/post_specdd_v19_n75/rep_1/harness/harness_summary.json). The v1.10 regression to empty_patch surrenders this Docker-resolved instance. Net Docker-grader movement is pending W3 — recoveries (7) must outweigh this surrendered resolve to call v1.10 a Docker win.

v1.10 mechanism wins (the proof the cycle worked):

  • intervention_conversion_rate 80.9% (47 fired, 38 converted) vs v1.9 full-stack 63.0% (27 fired, 17 converted) — +17.9pp. The convergence-score gating roughly doubled the intervention precision (more fires AND a higher conversion ratio).
  • 7 v1.9 full-stack empties recovered in v1.10 (gross): astropy-14096 (→ wrong_location), matplotlib-20676 (→ wrong_location), matplotlib-20826 (→ wrong_target), psf__requests-5414 (→ plausible), pydata__xarray-3095 (→ wrong_target), pylint-dev__pylint-4604 (→ new_file_in_diff), sphinx-doc__sphinx-10323 (→ wrong_location). 2 new regressions into empty_patch: sympy-13031 (was strong) and matplotlib-14623 (was wrong_target). Net empty_patch delta: −5 (19 → 14). Cross-arm note: vs the v1.9 gate-only arm (separate baseline; not the ship arm), 5 of v1.9-gate-only's 17 empties recovered in v1.10 — including pydata__xarray-2905 (gate-only-empty → v1.10 strong) and matplotlib-13989 (gate-only-empty → v1.10 strong); those instances were already non-empty under v1.9 full-stack so they do not count toward the gross-7 above.
  • sphinx-doc__sphinx-10435 ✓ 17 chars consistent across all v1.9/v1.10 runs that fired early_bail.

2026-05-14 audit correction: This bullet previously read "5 v1.9 empties recovered" and cited pydata__xarray-2905 as an example. The "5" was the net delta (7 gross recoveries − 2 new regressions), not the recovery count itself. The cited xarray-2905 example was a v1.9 gate-only recovery, not a full-stack recovery (under v1.9 full-stack it was already strong). Corrected: gross 7, regressions 2, net −5, arms labeled.

Track 1 of v1.10 (conditional intervention stacking) is the validated architectural pattern. Per-step convergence_score in [0.0, 1.0] composed from four sub-signals (repeated_same_path_access, edit_preview_behavior, localized_grep_density, file_entropy_last_K_events). Suppresses early_bail when score < LOW_THRESHOLD (0.10) and swaps soft_anchor → commit_imperative when score ≥ HIGH_THRESHOLD (0.40). Action-density gate suppressed at score ≥ HIGH. All thresholds documented in src/luxe/agents/convergence.py + the in-file block comment in loop.py.

Track 2 of v1.10 (soft_anchor wording iteration) shipped silently — dropped "rather than continuing broad exploration" comparative from _EARLY_BAIL_MESSAGE_SOFT_ANCHOR. v1.9 ARM 1 evidence showed Qwen3.6-35B-A3B interpreted it as "wrap up now"; positive imperative ending preserves commitment lever without the implicit stop signal. Validated by the +17.9pp conversion-rate jump.

Track 3 of v1.10 (commit_imperative variant) for HIGH convergence: when soft_anchor mode is active AND convergence_score ≥ HIGH_THRESHOLD, swap to _EARLY_BAIL_MESSAGE_COMMIT_IMPERATIVE — tighter wording for trajectories that have already converged on a target via repeated reads / localized greps.

Track 4 of v1.10 (mechanism-level primary metric)scripts/compare_v110.py emits composite (CONFIDENCE_COLLAPSE = 0 AND ABSTAIN_AFTER_INTERVENTION ≤ N AND intervention_conversion_rate ≥ X%). Denominator stability enforced: conversion rate computed among intervention-fired trajectories only. empty_patch demoted to derived secondary. First honest measurement of the mechanism-level distribution under v1.10 conditions.

Substrate plumbing also shipped (durable across future cycles):

  • scripts/save_run_id_manifest.py — preserves instance→run_id mapping immediately after a bench so subsequent runs can't poison the taxonomy. Closes the v1.9 backfill bug.
  • scripts/compare_v110.py accepts --baseline-taxonomy for safe comparison after workspace overwrite.
  • tool_call events now log path arg for retroactive convergence-score mining.
  • Habituation telemetry on action_density_sample events: time_to_first_write_after_intervention, write_burst_persistence, plus the existing since_intervention_step/kind.
  • --no-convergence-gate CLI flag for v1.10 ablation parity (reverts to v1.9 binary same_file_read_twice suppression).

File trail (v1.10 cycle):

  • src/luxe/agents/convergence.py (NEW) — pure convergence-score primitive; 28 unit tests
  • src/luxe/agents/loop.py — wires score into early_bail + action_density_gate; commit_imperative variant; bounded tool_history; post-intervention write telemetry
  • benchmarks/swebench/adapter.py — wires LUXE_CONVERGENCE_GATE=1 by default; convergence_gate=False kwarg for ablation
  • benchmarks/swebench/run.py--no-convergence-gate CLI flag
  • scripts/{compare_v110,save_run_id_manifest}.py (NEW)
  • tests/{test_convergence,test_loop_write_pressure,test_swebench_adapter}.py — +37 tests
  • acceptance/v110_taxonomy/v110_n75_full_stack_swebench.json — first honest mechanism-level measurement

v1.10.1 design brief (incremental — small surface area for fast iteration):

  1. Add an exploratory-support variant for score = 0.0 / score < LOW. Replace "suppress and do nothing" with "fire a low-pressure message that primes commitment without forcing it." Candidate wording: "Mid-loop notice: you have started exploring. As you continue, consider which file is most likely to need modification — you may begin attempting a small corrective edit when you have a candidate." Smoke on matplotlib-14623 specifically (the v1.10 archetypal regression) before any n=75 commit.
  2. Lower LOW_THRESHOLD or refine the score function. matplotlib-14623's score stayed at 0.0 because none of the four convergence sub-signals fired (no rereads, no greps in same dir as reads, no preview-before-write, max entropy). Either the threshold should be ≥ 0 (not > 0) so even "no information" cases get the exploratory variant, OR add a fifth sub-signal that captures any directional intent (e.g., grep-hit-rate, dir-localization-over-time).
  3. Intervention-habituation gate. sympy-13031 fired all three interventions and still produced 0 writes. The substrate has the telemetry (since_intervention_step, next_action_was_tool_call, time_to_first_write_after_intervention) — add a clean-exit predicate: after N interventions with no behavioral shift, exit cleanly rather than burning max_steps.

Earlier state — 2026-05-13 (v1.9.0 SHIPPED — substrate release; floor missed, mechanism win)

Working tree: clean post-tag. 728 tests passing. v1.9.0 tagged locally (annotated, signed; not yet pushed to origin pending user OK). Released atop v1.8.0 with the v1.9 cycle data preserved at acceptance/swebench/post_specdd_v19_n75{,_gate_only}/rep_1/ and acceptance/v19_taxonomy/.

v1.9 ship character: this is a substrate release, not a metric win. The literal empty_patch ≤13 floor was missed in both arms of the A/B; the v1.9 thesis claim (eliminate the CONFIDENCE_COLLAPSE class) was empirically validated. The durable substrate plumbing (adapter env wiring, ablation flags, taxonomy classes, density-gate predicate, mining script) is the value-add — v1.10 will turn it into a metric win via mechanism-isolation work.

Phase D n=75 A/B (run 2026-05-13, ~7h45m total wall):

Metric Target Full-stack (default) Gate-only ablation v1.8 baseline
empty_patch ≤13 19 17 17
strong ≥18 20 ✓ (best-ever) 16 ✗ 18
strong + plausible ≥35 38 39 35
CONFIDENCE_COLLAPSE class =0 0 0 2
wrong→empty regressions =0 2 ✗ 3 ✗ n/a

Mechanism win: both arms eliminated the v18 CONFIDENCE_COLLAPSE class. sphinx-10435 + sympy-13031 (the two named v18 strong→empty regressions) produced patches under full-stack. matplotlib-20676 (the v17 plausible→empty regression) produced 56 chars under gate-only. The v1.9 thesis — give the planner permission to commit under uncertainty without an abstain valve — is empirically real at n=75.

Floor miss diagnosis (architectural, not wording-alone): pure intervention stacking is non-Pareto. Full-stack PROTECTS strongs (0 strong→empty) but BREAKS some plausibles (matplotlib-25775, requests-5414). Gate-only PROTECTS plausibles (0 plausible→empty) but BREAKS slow-strongs (matplotlib-13989, xarray-2905 — both v18 strong cases needing step-4 early_bail to commit). The soft-anchor wording "rather than continuing broad exploration" empirically reads as "wrap up now" for some trajectories — sphinx-10435 rep_2 smoke terminated at step 6 with 832 tokens, no writes, after early_bail at step 4. Both findings inform the v1.10 plan.

Why full-stack ships as the default (not gate-only):

  • Strong count 20 is the best of any luxe cycle; substrate is gentler with high-confidence trajectories than under any prior config.
  • 0 strong→empty regressions vs v18.
  • --no-early-bail / --no-action-density-gate CLI ablation flags remain for v1.10 A/B work.
  • The floor miss is a wording/composition problem, not a code-path problem; reverting to gate-only would lose the strong-count gain without moving the floor.

File trail (v1.9 cycle):

  • src/luxe/agents/loop.py_EARLY_BAIL_MESSAGE_SOFT_ANCHOR variant + _ACTION_DENSITY_GATE_* constants + staged-escalation predicate (standalone + post_bail_rescue modes; convergence-proxy skip) + habituation telemetry on action_density_sample
  • src/luxe/agents/outcomes.pyIntervention.ACTION_DENSITY_GATE + FailureClass.CONFIDENCE_COLLAPSE (decoupled definition: empty + writes=0 + EARLY_BAIL fired)
  • benchmarks/swebench/adapter.py — wires LUXE_EARLY_BAIL + LUXE_ACTION_DENSITY_GATE + LUXE_EARLY_BAIL_MODE=soft_anchor by default; early_bail / action_density_gate kwargs for ablation
  • benchmarks/swebench/run.py--no-early-bail / --no-action-density-gate CLI flags
  • scripts/mine_action_density.py (NEW) — distribution miner with convergence telemetry (unique_files_touched, reread_ratio, same_file_read_twice)
  • scripts/compare_v19_ab.py (NEW) — full-stack vs gate-only ship-floor comparator
  • acceptance/v19_mining/{action_density_distribution.json,action_density_report.md,THRESHOLD_DECISION.md} — locked-in thresholds: step≥6, tok≥1500, tools≤10, bail+2
  • acceptance/v19_taxonomy/{full_stack,gate_only}_swebench_n75.json — backfill for v17/v18 comparison
  • benchmarks/swebench/subsets/v19_smoke_n14.json — phase-C smoke (kept as v1.10 message-iteration smoke set)
  • tests/test_loop_write_pressure.py (+8 tests), tests/test_outcomes.py (+3), tests/test_swebench_adapter.py (+3) — 728 total

v1.10 design brief — "mechanism-isolation cycle" (full version below in §v1.10 backlog):

  1. Conditional intervention stacking — convergence as a smooth SCORE (not binary), combining repeated_same_path_access, edit_preview_behavior, localized_grep_density, file_entropy_last_K. Intervention intensity scales with the score.
  2. Soft-anchor wording iteration — drop "rather than continuing broad exploration"; positive imperative ("Commit to the most promising file and attempt the smallest viable corrective edit"). Smoke on v19_smoke_n14 before any n=75 commit.
  3. Density-gate threshold re-derivation under v19 traces — split into pre_intervention_density_gate (baseline) and post_intervention_density_gate (rescue-path) with separately calibrated decay windows. New telemetry: time_to_first_write_after_intervention, write_burst_persistence.
  4. Mechanism-level primary metric — (CONFIDENCE_COLLAPSE=0 AND ABSTAIN_AFTER_INTERVENTION≤N AND intervention_conversion_rate≥X%), with empty_patch demoted to derived secondary. Conversion rate denominator is intervention-fired-trajectories-only for stability across trigger-policy changes.

Earlier state — 2026-05-13 (v1.8.0 SHIPPED — pre-dispatch gate + taxonomy primitives)

Working tree: clean. 712 tests passing. v1.8.0 tagged + pushed (e21b6b2, signed). Released atop v1.6.1 with the v1.7 cycle data preserved as the architectural-investigation baseline.

v1.8 cycle summary — one architectural win, one trade-off, three substrate primitives.

Phase Result Ship floor
C.8 BFCL n=1240 (Track 2 + 4) irrelevance 240/240 = 100%, total 90.24% (+1.85pp vs v1.7) ALL ✓ (+8pp over irrelevance)
B.5 SWE-bench n=75 (Track 1 + 3 + early_bail) strong 18, empty 17 empty_patch ≤13 missed at 17

Track 2 (pre-dispatch spec gate) is the v1.8 architectural win. When spec has any expects_zero_calls Requirement, the runtime intercepts tool dispatch BEFORE dispatch_tool runs — drops the call, does NOT add to actual_tool_calls, injects a decline reprompt, continues the loop. Capability gating, not policy auditing. Collapsed 23 FORBIDDEN_TOOL_EMISSION cases to zero with no regressions elsewhere. The substrate-legitimacy property is now reliably enforced at the dispatch boundary.

Track 5 (taxonomy) is the v1.8 observability primitive. src/luxe/agents/outcomes.py classifies every episode as (outcome, interventions_fired, failure_chain). Backfilled v17 + v18 in acceptance/v{17,18}_taxonomy/ — future cycles compare by mechanism-level distribution shifts, not aggregate score deltas.

Track 3 (no-abstain message overlay) is a wash on SWE-bench. LUXE_EARLY_BAIL_MODE=no_abstain env (or early_bail_message= kwarg on run_agent) selects an abstain-free variant. SWE-bench adapter sets the env; maintain_suite keeps default. Traded v17's 3 wrong→empty regressions for 2 new strong→empty bails (sphinx-10435, sympy-13031). Confidence collapse — v1.9 message lever.

Track 1 (prose-burst detector) is plumbing + observability. LUXE_PROSE_BURST=1 composite invariant fires once if step ≤4 with no tool calls + completion_delta ≥1500. Did NOT fire on any of the v17 empty class (empirical short-trace bailers have 2-4 tool calls, not zero). action_density logged unconditionally per step — substrate for v1.9 adaptive-threshold tuning.

Track 4 (irrelevance prompt tightening) is masked by Track 2. Effect not isolable in this cycle; A/B is v1.9 work.

Diligence finding (counterintuitive but important): 3-rep on BFCL multiple at temp=0 with oMLX restart between reps landed at 177/200 EXACTLY in all 3 reps. The substrate is fully deterministic — the supposed v1.7 "−4.49pp regression on multiple" turned out to be a phantom (I had cited "v1.6 ~92.99% baseline" which was fabricated; real v1.6 was also 88.50%). No prefix-cache contamination, no hidden interaction. Future cycles must verify baseline citations against summary.json rather than prior-session memory.

Open architectural debt (deferred to v1.9+):

  1. SWE-bench Phase B short-trace bailer class — unreachable by step≥4 rule; needs action_density gating (currently only logged). Track 1's LUXE_PROSE_BURST ships the plumbing; gating awaits distribution data.
  2. Confidence-collapse failure mode under no-abstain message — exposed by Track 3. v1.9 message lever: a "soft-anchor" variant that gives selection heuristic without abstain escape.
  3. Hard/soft constraint primitives. v1.8 ships only the hard flavor (expects_zero_calls). Soft discouragement + ranked priors are v2.x.
  4. Cross-model substrate evaluation via Track 5 taxonomy — first cross-model run is v1.9 territory.

File trail:

  • src/luxe/agents/outcomes.py (NEW) — Track 5 taxonomy
  • src/luxe/agents/loop.py — pre-dispatch gate, prose-burst, message overlay
  • benchmarks/swebench/adapter.py — sets LUXE_EARLY_BAIL_MODE=no_abstain
  • benchmarks/bfcl/adapter.py — tightened irrelevance system prompt
  • scripts/{diligence_multiple_3rep,backfill_v17_taxonomy,backfill_v18_taxonomy,inspect_v17_smoke,audit_v3_empties}.py
  • acceptance/{v17,v18}_taxonomy/, acceptance/{swebench,bfcl}/post_specdd_v18_*/

Earlier state — 2026-05-12 (v1.7 cycle complete, ship HELD pending redesign)

Working tree: clean. 687 tests passing. v1.6.1 last tag (pushed to origin 2026-05-11). 4 commits past v1.6.1 on main + pushed (early-bail substrate + Lever 1 wiring + BFCL adapter), but no v1.7 tag.

v1.7 bench cycle complete 2026-05-12 — both interventions delivered substantive wins on the spirit of the plan; both missed the literal ship floors. User held the v1.7 tag pending redesign rather than ship partial or iterate v1.7.1 on message wording alone.

Phase Run Headline Ship floor
B.4 SWE-bench n=18 smoke acceptance/swebench/v17_early_bail_smoke_n18/rep_1/ 6/18 converted (3 strong, 1 plausible, 2 wrong_target); 15/18 intervention fire rate conversion <10 vs ≥10 floor
B.5 SWE-bench n=75 full acceptance/swebench/post_specdd_v17_early_bail_n75/rep_1/ strong 16→19 (+3); empty_patch 18→16 (-2); 3.77h wall empty_patch 16 vs ≤8 floor
C.7 BFCL irrelevance smoke acceptance/bfcl/v17_smoke_irrelevance/rep_1/ 217/240 = 90.42% (+4.59pp vs v1.6 agent) marginal vs +5pp gate
C.8 BFCL n=1240 full acceptance/bfcl/post_specdd_v17_lever1/rep_1/ total 88.39% (+4.68pp); parallel_multiple 64.5→83.0% (+18.5pp); irrelevance 90.42% irrelevance 90.42% vs ≥92% floor

The biggest v1.7 win: parallel_multiple +18.5pp via Lever 1's min_tool_calls predicate — this is the single largest cycle movement. The min_tool_calls loop-break reprompt is empirically the most reusable Lever 1 wire shape: structural cardinality cues from GT length, mid-loop nudge, no leakage of values.

Why the floors were missed (architectural, not message wording):

  • SWE-bench short-trace bailer class (3 of 18 v3 empties) clean-exit at step ≤3 with 8000+ completion tokens. LUXE_EARLY_BAIL's MIN_STEP=4 rule cannot reach them. Fix requires a per-step prose-burst detector (currently completion_tokens is cumulative-only).
  • SWE-bench early_bail abstain branch caused 3 cases that produced SOMETHING under v3 (wrong_target/wrong_location) to regress to empty_patch under v17 — model took the "explicitly state the existing code is correct" escape valve.
  • BFCL expects_zero_calls fires too late — predicate evaluates AFTER the violating call is added to actual_tool_calls, which the grader has already counted as failed. Fix requires pre-dispatch validation (refuse to call the tool entirely, not just reprompt afterward).

v1.7-redesign queue (see lessons.md 2026-05-12 entry for full design):

  1. Per-step token-delta plumbing in loop.py (currently only cumulative). Powers a prose-burst detector for the short-trace bailer class.
  2. Pre-dispatch spec gate in loop.py — when spec has any expects_zero_calls requirement, intercept tool dispatch and refuse rather than dispatch-then-reprompt.
  3. SWE-bench-specific message overlay so abstain branch can be stripped from _EARLY_BAIL_MESSAGE for SWE-bench without affecting maintain_suite (which legitimately may want abstain).
  4. Tighten irrelevance system prompt with "do not call them under any circumstance" language.

Pending diligence: simple_python (-1.79pp) and multiple (-4.49pp) showed minor BFCL regressions in C.8 vs the v1.6 agent baseline. These categories don't get a Lever 1 spec (single-call GT), so the regression is unrelated to Lever 1. Could be temperature variance or substrate-tier drift. Worth a separate pass before the redesign.

Earlier state — 2026-05-11 (v1.6.1 SHIPPED — substrate hardening + maintain_suite Lever 2 extension + BFCL agent anchor)

Working tree: clean. 652 tests passing (bfcl_eval adapter tests now green after dep landed). v1.6.1 tagged locally at 0a964bf (annotated, signed) on top of v1.6.0 (10 commits since: 7 substrate/maintain_suite + 3 doc rolls). Tag not pushed to origin; the local main branch is 1 commit ahead of origin/main from before the tag.

M5 Max MoE bake-off complete (acceptance/m5max_moe/, 2026-05-10). The full run started at 17/30 (81/150, GLM 0/10) and landed at 30/30 (120/150, all 3 variants pass v1 gate) modulo a single transient embedded null byte ValueError at the commit step (lpe-rope-calc-implement-strict-flag on GLM, scored 4/5 on the recheck). The final official bench shows 29/30; the variance recheck confirms the true rollup is 30/30. See lessons.md 2026-05-10 m5max_moe entry for the full postmortem.

Six fix vectors landed durably:

  1. tools/base.py dispatch_tool strips whitespace in the tool name. GLM-4.5-Air-4bit emits "read_file\n" / "bash\n\n" etc.; without the strip, every dispatch missed and the model bailed (0/10 baseline → 7/10 from this fix alone).
  2. agents/loop.py normalizes tc.name at the loop boundary too. The dispatcher fix wasn't enough — _WRITE_TOOLS, _DEDUP_EXEMPT_TOOLS, schema validation, and dedup keying all read the raw name. With whitespace, writes_seen never incremented for GLM, so WRITE_PRESSURE fired after diffs landed and _POST_WRITE_IDLE_MAX never armed.
  3. agents/loop.py _WRITE_PRESSURE_MAX_TOOLS_BEFORE_FIRE = 15 OR-branch on the existing completion-tokens gate. The 4000-token threshold was calibrated on qwen3.6-35B's prose-heavy failure; qwen3-coder-next averages 1855 completion tokens per fixture — the gate was unreachable. 10 of 11 firings in the verifying re-bench hit the tool-ceiling branch.
  4. agents/loop.py _POST_WRITE_IDLE_MAX = 3 — once any write succeeds, 3 consecutive 0-byte non-write calls trigger a clean exit (not aborted). Catches the post-success verification drift the dup-detector eventually catches but marks as bailout. Fired in 13/30 runs.
  5. benchmarks/maintain_suite/run.py sets LUXE_WRITE_PRESSURE=1 via env.setdefault so the read-loop interrupt is the bench default (ablations can still override).
  6. SpecDD Lever 2 extended to maintain_suiteFixture.forbids_create: list[str] + _inject_forbids_create_sdd writes <repo>.sdd at the cloned-repo root + appends to .git/info/exclude so the synthetic contract doesn't pollute fixture diffs. Three opted-in fixtures (lpe-rope-calc-implement-strict-flag, the-game-implement-shuffle-shortcut, neon-rain-implement-reset-shortcut) get cross-product coverage of test-name shapes (prefix/suffix × separator × root/subpath). Verified end-to-end: .sdd lands, exclude registers, fixture diffs stay clean.

Per-machine env state (not version-controlled, documented inline in RESUME.md §oMLX configuration and §maintain_suite bench-host prereqs):

  • ~/.omlx/settings.json sampling.max_context_window: 32k → 48k (qwen3-coder-next was hitting 33k+ per turn on nothing-ever-happens-document-config).
  • brew install node (npm 11.12) — fixture neon-rain-implement-reset-shortcut shells out to npm test.

The variance class is open for v1.7. GLM at temp=0 still shows ~10% per-fixture variance across replicates (orphan scaffold creation, transient embedded null byte from the commit step). Existing scoring gates (vacuous_test, orphan_file) catch these; Forbids creating cuts the rate further via the recovery-gradient error wording. Lever 3 positive constraints ("you must edit X") are the long-term answer per the v1.7 backlog below — not gating any v1 bench.

BFCL v3 anchors filed (2026-05-11) — both runs completed clean on top of the v1.6 substrate:

  • Raw mode (regression check, ~6.1h): 948/1240 = 76.45% (+0.16pp vs pre-SpecDD 76.29%) — no infra drift across v1.4.1 → v1.6.1.
  • Agent mode (one-shot v1.6 datapoint, 8.47h): 1038/1240 = 83.71% (+7.26pp vs raw). Parallel cliff +17pp (parallel) and +16.5pp (parallel_multiple) is the dominant lift; irrelevance regressed −6.25pp (loop primes tool-eagerness). Wall ETA originally estimated at 18–24h; the substrate's per-call efficiency lands it at ~25s/problem instead.

BFCL agent adapter does NOT wire .sdd injection or the Lever 1 spec validator (benchmarks/bfcl/adapter.py:run_problem_agent) — the +7.26pp is loop-vs-single-shot, not SpecDD-driven. That wiring is now v1.7 priority #2 below. Side lesson: the parallel_multiple probe (n=50, 86%) was 21.5pp optimistic vs the full n=200 (64.5%) — BFCL subset files are ordered, not shuffled; future probes must sample randomly or be framed strictly as infrastructure validation.

BFCL raw-vs-agent comparison ambiguity (v1.7+) — once Lever 1 is wired into run_problem_agent (priority #2), agent-mode runs include GT-structure hints: call cardinality for parallel/parallel_multiple problems (min_tool_calls predicate) and the zero-call expectation for irrelevance (expects_zero_calls predicate). Raw mode does NOT include these hints. Post-v1.7 raw-vs-agent deltas measure [loop scaffolding + Lever 1 hints] vs [no loop], not loop alone. Re-baseline raw mode after each substrate change if substrate-only deltas are needed. The fairness call (use structure, not values) is per RESUME.md v1.7 priority #2 design; documented inline in benchmarks/bfcl/adapter.py:_spec_from_problem.

See memory entries project_bfcl_post_specdd_v16_raw.md + project_bfcl_post_specdd_v16_agent.md; lessons.md 2026-05-11 entry has the full postmortem.

v1.6.1 SHIPPED 2026-05-11 (tag 0a964bf, local only — not pushed to origin). Patch on top of v1.6.0 capturing: (a) substrate hardening from the m5max_moe bake-off, (b) SpecDD Lever 2 extended into maintain_suite, (c) BFCL v3 agent anchor (data only, no code). No architectural shift — v1.7 is reserved for early-bail intervention and BFCL Lever 1 wiring per the priority list below.


⚡ Resume here — v1.7 priorities (unchanged)

The four remaining v1.6-era loose ends below still apply. The m5max_moe substrate work landed durably and clears the path for v1.7 work; the "open question" from the m5max_moe lessons.md entry — do the threshold-asymmetry findings generalise to SWE-bench? — is now the natural first probe before the early-bail intervention design lands.

v1.7 priorities (in order of expected impact)

  1. Early-bail intervention — addresses ≥10 of the 18 v3 paired-mechanism empty_patch cases (the agent_bailed class). Interception strategy: detect the bail signature in the loop (consecutive low-output steps + no write-tool calls) and inject a directive turn rather than letting the loop trip its stuck detector. Prerequisite: LUXE_LOG_TOOL_CALLS=1 traces of the 18 v3 empties to confirm class composition. With m5max_moe's _POST_WRITE_IDLE_MAX and tuned WRITE_PRESSURE thresholds now in place, the bail-class composition may already shift before any v1.7 work lands — worth re-checking traces before designing.
  2. BFCL Lever 1 wiring + abstain gradient — two-part. (a) Extend benchmarks/bfcl/adapter.py:run_problem_agent to derive a per-problem Spec from the expected-calls structure and pass it as a reprompt gate. (b) Address the −6.25pp irrelevance regression with an explicit "no-call is a valid outcome" gradient — either as a Lever 1 predicate (expects_zero_calls: true) or as system_prompt language. Baseline to beat: agent 83.71% total, parallel_multiple 64.5%, irrelevance 85.83%. Lever 1 is doing real work in BFCL iff parallel_multiple climbs further AND irrelevance recovers toward 92%.
  3. b2 multi-site retrieval — extend the spec-validator predicate kinds so SpecDD Lever 1 can demand citations from N sites within a single fixture. Closes the loose-grader gap surfaced in project_loose_grader_audit.md.
  4. In-loop test execution feedback — pipe pytest results from the previous step back into the model's next prompt. Likely gates the second strong-tier rebound (Phase B nearest-anchoring tightening, slated to fire here).
  5. Mode B threshold tuning — broader bench data is incoming from v3 + Phase B; revisit the 10 tools / 4000 tokens / step 5 thresholds against the v3 traces. The m5max_moe tune (tool-ceiling OR-branch) already addressed the most acute miscalibration on tool-call-heavy models; more granular per-model defaults are next.
  6. Lever 3 — held until empty_patch class is fully addressed; Lever 3 needs clean separation of constraint vs reasoning failures, and the empty_patch class confounds that boundary today.

v1.6-era loose ends (status as of 2026-05-11)

  1. BFCL v3 post-SpecDD raw-mode DONE 2026-05-11: 948/1240 = 76.45% (+0.16pp vs pre-SpecDD 76.29% — well inside ±2pp tolerance; no infra drift). Folded into v1.6.1 docs.
  2. BFCL agent-mode post-SpecDD run DONE 2026-05-11: 1038/1240 = 83.71% (+7.26pp vs raw v1.6). Parallel cliff +17pp; irrelevance regressed −6.25pp (loop primes tool-eagerness). Folded into v1.6.1 docs; baseline-to-beat captured in v1.7 priority #2.
  3. (Optional follow-up) Re-aggregate the v3 harness summary into a tracked harness_summary.json once the rebuilt harness.py:collect_results fix is exercised on a fresh run. Current summary was written via the fixed collector against the existing logs/run_evaluation/luxe_v16_n75/ dir.
  4. sphinx-doc__sphinx-10466 strong→unresolved is the lone strong tier instance the harness rejected. Worth a glance for v1.7 prep but not a v1.6 blocker.

Earlier state — 2026-05-10 (v1.6.0 SHIPPED)

Working tree: clean post-tag. 643 tests passing. v1.6.0 tagged with the v3 ship-floor + Docker harness numbers. BFCL v3 post-SpecDD raw-mode comparison run kicked off (~3.5h wall, in-progress as of tag time).

Ship-floor result (Phase D Step 3, all gates green):

Signal Floor v3 actual
new_file_in_diff =0 0 ✅ (jq cross-check confirms zero new file mode)
strong ≥14 16
strong + plausible ≥30 36
empty_patch ≤18 18
wrong_target ≤20 (soft) 17 ✅ (no Phase B anchoring spike)

Docker harness (Phase D Step 4, n=75): 36/75 = 48.0% resolved in 34m43s, 0 errors. Tier × resolved: strong 15/16 (94%), plausible 10/20 (50%), wrong_target 8/17 (47%), wrong_location 3/4 (75%), empty_patch 0/18 (0%). The strong inspector tier is a near-perfect predictor of harness-resolution; 11 wrong_target/wrong_location resolves are alternative-solution credit (model fixed a different file/locus than gold, tests pass anyway).

v3 vs pre-Lever-2 baseline (long-arc claim): strong 12→16 (+33%); empty_patch 26→18 (−10.7pp); new_file_in_diff 4→0 (full class elimination); any non-empty 45→57 (+27%). Paired-mechanism win sustained AND class eliminated.

v3 vs v2 (creation-only delta): new_file 2→0 (the architectural target). xarray-3305 + sphinx-10466 both empty/wrong_loc → strong (variance, not collateral, confirmed). sympy-12481 invent→plausible (gold file modified). matplotlib-24870 new_file→empty (1/2 v2-escape "constraint pressure → occasional abandonment", within budget).

Architectural shift recap — operation-aware policy: v1.5 encoded "these filenames are suspicious" (path-aware). v1.6 encodes "creating verifier scaffolding is disallowed" (operation-aware). .sdd gains a new section Forbids creating that fires only when a write would create a new file at the target path. The policy boundary now matches the behavioral distinction the system was missing: repository participation (legitimate edits to existing files) vs benchmark gaming (invented validation scaffolds).

Section Fires on edit? Fires on create?
Forbids (existing)
Forbids creating (v1.6 new)

creating = not Path.is_file() is computed in _write_file at the moment of the write. _edit_file always passes creating=False (existence enforced two lines later). Disk state naturally handles the multi-step trajectory case (create in step 1, edit in step 2) without synthetic planner state. Distinct error messages: "forbidden ... do not write outside allowed paths" for unconditional Forbids (reads as wrong location) vs "forbidden-on-create ... Edit an existing file instead of creating a new one" for create-only matches (reads as wrong operation; primes reroute, not bailout).

Phase A static audit (full SWE-bench Verified n=500, 2026-05-06): CLEAN — zero gold patches create a test_*.py file. The broad **/test_*.py create-ban ships as a stable adapter-wide policy, not subset-specific tuning.

Phase C smoke (n=14, acceptance/swebench/v16_smoke_n14/rep_1/, 2026-05-06):

  • new_file_in_diff = 0 across all 14 ✅ (HARD floor met)
  • sympy-12481 reroute (the architectural test case): was inventing test_fix_check.py in v2 → v1.6 produced a strong gold-match by editing sympy/combinatorics/permutations.py directly. The qualitative transition invent scaffold → modify existing artifact was empirically demonstrated.
  • Both v2 strong-tier "regressions" (xarray-3305, sphinx-10466) rebounded to strong → confirms variance hypothesis (not glob collateral).
  • v2-strong preservation 4/5 (matplotlib-13989 dropped to empty — within ±1 variance budget).
  • matplotlib-24870 (other v2 escape) went empty rather than rerouting. 1/2 architectural test cases reroute cleanly; the other shows the user-predicted "constraint pressure → occasional abandonment". Mixed but net positive.

SWEBENCH_SDD_BODY split: only repo_root/** (synthetic prompt-context path) stays in Forbids. ALL scaffolding-name patterns moved to Forbids creating, including the v2-escape additions: test_*.py, **/test_*.py, test_fix_*.py, **/test_fix_*.py. Internal .sdd dogfood (src/luxe/luxe.sdd etc.) unchanged — Forbids creating is bench-specific in v1.6.

See ~/.claude/plans/cozy-wiggling-conway.md for the full v1.6 plan, the audit gates, the ship-floor table, and the Phase B nearest-existing-test anchoring watch.

v1.6 ship-cycle Phase D reference commands (kept for re-run)

Step 1 — n=75 v3 rerun with creation-only forbids — DONE 2026-05-09

Reference command (kept for re-run):

brew services restart omlx && sleep 5 && \
cd ~/Downloads/luxe && \
LUXE_LOG_TOOL_CALLS=1 OMLX_API_KEY=omlx-sdb25582k3mq8pf9 nohup \
  .venv/bin/python -m benchmarks.swebench.run \
    --subset benchmarks/swebench/subsets/v1_baseline_n75.json \
    --output acceptance/swebench/post_specdd_v16_creation_only_n75/rep_1/ \
    > /tmp/n75_v16.log 2>&1 &

Adapter binds LUXE_WRITE_PRESSURE=1 and disables commit.gpgsign automatically; no shell env munging needed beyond OMLX_API_KEY. Restart oMLX before any rerun to clear pinned models.

Step 2 — Compare v3 vs prior runs

# v3 vs pre-Lever-2 baseline (the long-arc claim)
.venv/bin/python -m benchmarks.swebench.compare_runs \
    --pre  acceptance/swebench/pre_specdd_v141_n75/rep_1/predictions.json \
    --post acceptance/swebench/post_specdd_v16_creation_only_n75/rep_1/predictions.json \
    --gold-source benchmarks/swebench/subsets/raw/verified.jsonl

# v3 vs v2 (isolates the creation-only semantic shift)
.venv/bin/python -m benchmarks.swebench.compare_runs \
    --pre  acceptance/swebench/post_specdd_v15_pressure_v2_n75/rep_1/predictions.json \
    --post acceptance/swebench/post_specdd_v16_creation_only_n75/rep_1/predictions.json \
    --gold-source benchmarks/swebench/subsets/raw/verified.jsonl

# Inspector — verdict tally + new_file_in_diff escape audit
.venv/bin/python -m benchmarks.swebench.smoke_inspect \
    --predictions acceptance/swebench/post_specdd_v16_creation_only_n75/rep_1/predictions.json \
    --gold-source benchmarks/swebench/subsets/raw/verified.jsonl \
    | grep -E "^  (strong|plausible|empty_patch|new_file_in_diff|wrong_location|wrong_target)" \
    | awk '{print $1}' | sort | uniq -c

Step 3 — Ship-floor check (HARD; all must hold)

The headline is not new_file_in_diff = 0 in isolation — that alone could be achieved by suppressing all writes (which would push empty_patch up). The success signal is the combination: scaffolding creation blocked AND model didn't bail under the additional pressure AND model rerouted to correct edits, not any edits.

Signal Floor v2 actual v3 target
new_file_in_diff =0 2 =0 (HARD)
strong ≥14 16 ≥14
strong + plausible ≥30 35 ≥30
empty_patch ≤18 17 ≤18 (within +1 of v2)
wrong_target ≤ v2 + 4 16 ≤20 (soft watch — Phase B "nearest-existing-test anchoring")

Acceptance gate:

  1. Inspector reports zero new_file_in_diff entries.
  2. jq cross-check on v3 predictions.json: list any model_patch containing new file mode lines — should agree with inspector at zero.
  3. strong ≥14 AND strong+plausible ≥30 AND empty_patch ≤18.
  4. wrong_target composition delta vs v2 — if it spikes by +5 or more, Phase B nearest-anchoring watch fired (model satisfied pressure by editing some existing test rather than the correct one). Inspect 3 random wrong_target rows that came from previously-empty v2 instances; if model_files cluster on tests/..., anchoring is real and tag should hold for v1.7 planning-prompt tuning.
  5. Spot-check 3 random strong rows by reading the patch — guards against "broad glob accidentally blocked legit edits".

Stop conditions:

  • Any of (1)-(3) fails → do NOT tag. Investigate what shape escaped.
  • (4) fires (wrong_target +5 or more) → hold tag. Phase B postmortem before deciding ship vs v1.7-tune.
  • empty_patch climbs above 22 → the new error message + create-only semantics aren't providing the recovery gradient; v1.6 needs a re-read.

Step 4 — Docker harness scoring (~30-45m) — MANDATORY before ship-doc write-up (v1.10.1 ritual update)

v1.10.1 audit ritual fix: Docker harness numbers MUST land BEFORE the ship-doc + tag is written, not as a follow-up. The v1.10 audit caught that writing the ship doc against inspector-tier only missed (a) the matplotlib-14623 Docker-resolved surrender, (b) the sphinx-10673 silent same-tier Docker demotion — both invisible without harness output. If the harness takes 30–45m, that's the same window as polishing the doc; build it into the cycle.

Run the wrapper at benchmarks/swebench/harness.py against the cycle's predictions.json. Confirm Docker Desktop is up + ~10GB free + RAM headroom. Output to the cycle's harness/ subdir. Numbers go into the release commit body and the RESUME ship-character table (both patched % and overall % kept visually separate per the v1.10.1 reporting discipline).

Step 5 — Tag v1.6.0

Tag message records v3 absolute floors AND delta vs v2 (creation-only effect) AND delta vs pre-Lever-2 baseline (long-arc claim):

git tag -a v1.6.0 -m "$(cat <<'EOF'
v1.6.0: SpecDD Lever 2 — creation-only forbids (operation-aware policy)

`.sdd` gains a `Forbids creating` section that fires only when a
write would create a new file. Splits two qualitatively different
operations the v1.5 contract conflated:
  - editing a pre-existing file (legitimate repository participation)
  - inventing a new file (benchmark gaming)

`creating = not Path.is_file()` is operationally observable,
deterministic, and stateful across turns automatically — disk state
handles multi-step trajectories without synthetic planner state.

Distinct error wording for the create-only class
("forbidden-on-create ... Edit an existing file instead of creating
a new one") gives the planner a recovery gradient — wrong operation
rather than wrong location.

Phase A static audit (full SWE-bench Verified n=500): zero gold
patches create a test_*.py file → broad **/test_*.py create-ban
ships as a stable adapter-wide policy.

n=75 v3 (creation-only forbids):
  strong:                <v3>  (v2: 16 → v3: <delta>)
  strong + plausible:    <v3>  (v2: 35 → v3: <delta>)
  empty_patch:           <v3>  (v2: 17 → v3: <delta>; baseline 26)
  new_file_in_diff:      <v3>  (v2: 2 → v3: 0;  baseline 4)
  wrong_target:          <v3>  (v2: 16 → v3: <delta>)
  any non-empty patch:   <v3>
  FAIL_TO_PASS (Docker harness): <pre> → <post>

vs pre-Lever-2 baseline (acceptance/swebench/pre_specdd_v141_n75/rep_1/):
  empty_patch:           -<X>pp  (paired-mechanism win, sustained)
  new_file_in_diff:      0       (full class elimination, durable)
  strong:                +<X>    (gold-match increase, durable)

The architectural shift: v1.5 encoded "these filenames are suspicious"
(path-aware folklore). v1.6 encodes "creating verifier scaffolding is
disallowed" (operation-aware policy). The policy boundary stops
conflating two distinct operations on the same target.
EOF
)"

Earlier state — 2026-05-06 morning (v1.5.0-rc-2; v2 result captured, ceiling discovered)

v2 n=75 rerun result (acceptance/swebench/post_specdd_v15_pressure_v2_n75/rep_1/):

Metric Pre-Lever-2 baseline Post-Lever-2 (no pressure) v1 paired v2 paired Ship floor
strong (gold-match) 12 13 16 16 ≥12
strong + plausible 30 32 32 35 ≥30
empty_patch 26 30 14 17 ≤28
new_file_in_diff 4 0 8 2 =0
any non-empty patch 49 45 61 56

Headline (v2 vs baseline): empty_patch 26 → 17 (−35%); strong 12 → 16 (+33%); any-non-empty 45 → 56 (+24%). The paired-mechanism (.sdd constraint + WRITE_PRESSURE actuation) sustained its win.

Headline (v2 vs v1): new_file_in_diff cratered 8 → 2 (−75%) under broad-glob tightening. 6 of 8 v1 escapes routed to legitimate buckets (1 strong, 3 plausible, 2 wrong_target).

The blocker: 2 escapes remained — test_bool_contour.py (matplotlib-24870) and test_fix_check.py (sympy-12481). Both shapes are indistinguishable from legitimate test files by name alone. No broad glob can safely cover them as edit-or-create bans. The v1.5 broad-glob approach hit an architectural ceiling that more patterns cannot resolve. Hence v1.6.

Falsification check passed (2026-05-06): gold patches for the two strong-tier "regressions" (xarray-3305, sphinx-10466) and the xarray cluster (xarray-6938) do NOT match any v1.5 broad glob. Those regressions are temp=0 variance, not glob collateral. Smoke (Phase C) later confirmed: both rebounded to strong under v1.6.


Earlier state — 2026-05-04 night (pre-SpecDD anchors)

SWE-bench n=75 pre-SpecDD anchor — DONE (acceptance/swebench/pre_specdd_v141_n75/rep_1/):

  • 7h 34m wall (15:47 → 23:21 on 2026-05-04). 49/75 non-empty patches; mechanical 45/75 (60%).
  • Strong (gold-match): 12/75 = 16%. Strong + plausible: 30/75 = 40%. Manual high-confidence (post Step-2 review): 24/75 = 32%.
  • Empty-patch (26/75 = 35%) is the dominant failure mode at n=75 scale; n=10 had zero. Anti-reproducer prompt's locate→read→edit→verify protocol fails to even produce a candidate diff on a third of stratified instances.
  • 4/75 created test_fix.py despite anti-reproducer rule — prompt is leaky; tool-side enforcement is the right shape.

BFCL pre-SpecDD baseline complete (acceptance/bfcl/pre_specdd_v141/rep_1/, 2026-05-04):

  • TOTAL: 946/1240 = 76.29% in ~3.5h wall
  • Parallel cliff: parallel_multiple sits 33pp below single-call avg.

Explicit non-goals this session

  • Lever 3 — held until empty_patch class is fully addressed. Lever 3 needs clean separation of constraint vs reasoning failures; the empty_patch class confounds that boundary until early-bail intervention lands.
  • Phase B trace inspection on matplotlib-24870 — non-blocking diagnostic. Doesn't gate v1.6 tag; informs whether bailout-after-forbid is a real interaction or just hard-instance variance. Slated for v1.7 prep.
  • Tagging v1.6.0 with current data — would lock in unverified ship floor. Wait for v3.

Background tasks (queued, non-blocking)

These do not block v1.6.0 tag; revisit after the overnight v3 lands.

  • Retire v1.3 directive reprompt code in cli.py (~15 min) — superseded by SpecDD Lever 1 spec validator
  • min_added_lines as per-requirement predicate kind in src/luxe/spec.py
  • ast_query and manual predicate full integrations (currently stubbed)
  • Tune Mode B thresholds based on broader bench data (currently 10 tools / 4000 tokens / step 5) — extra signal incoming from v3 + Phase B
  • Bring benchmarks/swebench/run.py ETA format into BFCL standard (group + global counts) — cosmetic
  • Per-fixture .sdd contracts on the maintain_suite (Lever 3 prep) — depends on trace: field audit
  • Minimality-bias A/B (orthogonal experiment proposed pre-Lever-2): adds swebench_bugfix_minimal PromptVariant. Re-evaluate after v1.6 ships — may not be needed if empty_patch is already in target range.

Memory entries (read first)

External benchmark program — current focus:

  • project_v16_creation_only.mdPRIMARY v1.6 creation-only forbids ship state + n=14 smoke result + n=75 v3 plan
  • project_v15_specdd_lever2_shipped.md — v1.5 Lever 2 ship state + paired-mechanism reframe
  • project_swebench_n75_baseline.md — pre-Lever-2 anchor: 32% high-confidence; empty-patch 26/75 dominant
  • project_swebench_smoke_2026_05_04.md — n=10 A/B + a/b1/b2/b3/c/d/e taxonomy (n=10 was 50pp optimistic; superseded by n=75)
  • project_bfcl_pre_specdd_baseline.md — 76.29% combined, parallel cliff diagnosed
  • project_external_benchmark_program.md — overall SWE-bench n=75 + BFCL v3 plan

Bench-substrate / failure-mode work:

  • project_doc_config_three_modes.md — A/B/C decomposition of doc-config variance
  • project_v1_4_1_mode_b_validation.md — 10/10 PASS validation
  • project_v1_4_validation.md — original v1.4.0 3-rep result (9.67/10 effective)
  • project_compound_goal_audit.md — SpecDD premise empirically thin
  • project_loose_grader_audit.md — 5/10 graders looser than goal text (closed at v1.4 spec layer)

Diagnostic / process:

  • feedback_exception_hierarchy_catch_order.md — when except clauses cover an inheritance hierarchy, derived class first
  • feedback_fixture_prep_dirty_tree.md — synthetic-.sdd-class fixture prep needs --allow-dirty in the agent invocation
  • feedback_deliberation_amplifiers.md — don't extrapolate "think more" prompt clauses from single-instance probes; A/B before shipping
  • feedback_benchmark_progress.md — all bench runners need group + global elapsed/remaining/ETA
  • feedback_instrument_loop_first.mdLUXE_LOG_TOOL_CALLS=1 before adding prompt mass
  • feedback_verify_fixture_grader.md — read base file before debugging model behavior
  • feedback_replicate_borderline_fixtures.md — 3× replicate before claiming regression
  • feedback_offline_cache_refs.md — don't read origin/<branch> in offline cache
  • feedback_offer_long_running_commands.md — bench >5 min: hand off, don't auto-run
  • feedback_validate_first.md — cheap probe before multi-hour runs

Closed non-starters:

  • project_mlx_use_ane_probe.md — feature doesn't exist in MLX
  • project_omlx_logprobs_unsupported.md — oMLX silently strips logprobs:true
  • project_qwen3_migration.md — fully reverted

Latent / open:

  • project_regrade_local_origin_bug.md — fixed in v1.4.1
  • project_gh_auth_flake.md — open but mitigated by --retry-errors
  • project_lmstudio_loop.md — open
  • project_omlx_metal_crashes.md — latent

30-second orientation

luxe is an MLX-only repo maintainer for Apple Silicon (oMLX backend on localhost:8000). Takes a goal + repo, opens a PR. Mono-only since v1.0 — single model, single agent loop, single luxe maintain command. Champion: Qwen3.6-35B-A3B-6bit in configs/single_64gb.yaml.

What's shipped through v1.10.0:

  • v1.0 — mono-only; 10 fixtures; strict gates
  • v1.1 — pinned work_dir default + manage_strict overlay → 9/10
  • v1.2 — per-tool subphase pass: cve_lookup gated to manage; bash chain-hardening; read_file binary detection
  • v1.3 — read_file dedup exemption + lpe-typing fixture surgery + reprompt-on-doc + _diff_against_base fix
  • v1.4 — SpecDD Lever 1: programmatic Definition of Done; per-requirement spec validator; reprompt gate uses spec
  • v1.4.1 — citation-linter bare-filename fallback (Mode A) + Mode B mid-loop write-pressure (opt-in) + sidecar regrade lint re-run
  • v1.5.0-rc-2 — SpecDD Lever 2 paired-mechanism (.sdd constraint + WRITE_PRESSURE actuation); 619 tests
  • v1.6.0 (tagged 2026-05-09) — creation-only Forbids: .sdd gains Forbids creating section, creating: bool threaded through write-time guards; recovery-gradient error wording; SWE-bench n=75 v3 36/75 = 48.0% harness-resolved; 643 tests
  • v1.6.1 (tagged 2026-05-11 0a964bf, pushed to origin) — substrate hardening (6 fix vectors from m5max_moe bake-off); SpecDD Lever 2 extended into maintain_suite (Fixture.forbids_create + synth .sdd injection); BFCL v3 anchors (raw 76.45%, agent 83.71%); 652 tests
  • v1.8.0 (tagged 2026-05-13 e21b6b2, pushed to origin) — Track 2 pre-dispatch spec gate (capability gating); Track 5 episode-outcome taxonomy (src/luxe/agents/outcomes.py); Track 3 SWE-bench message overlay (LUXE_EARLY_BAIL_MODE=no_abstain); Track 1 prose-burst detector + action_density observability (LUXE_PROSE_BURST=1); Track 4 irrelevance prompt tightening. BFCL n=1240 = 90.24% (irrelevance 100%, +9.58pp); SWE-bench n=75 wash with v17 (empty floor missed, deferred to v1.9). 712 tests. (v1.7 cycle data preserved; no v1.7 tag.)
  • v1.9.0 (tagged + pushed 2026-05-13 — SUBSTRATE RELEASE) — LUXE_ACTION_DENSITY_GATE staged-escalation predicate (standalone + post_bail_rescue modes; convergence-proxy skip; thresholds from scripts/mine_action_density.py); _EARLY_BAIL_MESSAGE_SOFT_ANCHOR variant (selection heuristic without abstain valve); Intervention.ACTION_DENSITY_GATE + FailureClass.CONFIDENCE_COLLAPSE taxonomy classes (decoupled definition); adapter wires the full intervention stack by default + --no-early-bail / --no-action-density-gate CLI ablation flags; habituation telemetry on action_density_sample. CONFIDENCE_COLLAPSE class eliminated (0 in both A/B arms; v18 had 2); empty_patch floor MISSED (full-stack 19, gate-only 17 vs ≤13 target); strong count best-ever at 20. 728 tests. Note: v1.9 backfill taxonomy was poisoned by workspace-stdout-overwrite bug; v1.10 closed this via scripts/save_run_id_manifest.py. Docker harness post-ship: 34/56 = 60.7% FAIL_TO_PASS resolved (34/75 = 45.3% total).
  • v1.10.0 (tagged 2026-05-14, local only — MECHANISM-ISOLATION SHIP) — src/luxe/agents/convergence.py (NEW) smooth convergence score [0,1] composed from four sub-signals (repeated_same_path_access, edit_preview_behavior, localized_grep_density, file_entropy_last_K_events). LUXE_CONVERGENCE_GATE=1 wires conditional intervention stacking: suppress early_bail when score < LOW (0.10), swap soft_anchor → commit_imperative when score ≥ HIGH (0.40), suppress action_density_gate at high convergence. _EARLY_BAIL_MESSAGE_SOFT_ANCHOR wording iteration (drops "rather than continuing broad exploration"); new _EARLY_BAIL_MESSAGE_COMMIT_IMPERATIVE for high-convergence trajectories. scripts/compare_v110.py (NEW) composite mechanism-level primary metric. scripts/save_run_id_manifest.py (NEW) preserves instance→run_id mapping across workspace overwrites. n=75 result: empty_patch 14 (best-ever, tied with v1.5; floor ≤13 missed by 1); intervention_conversion_rate 80.9% (+17.9pp vs v1.9 full-stack 63.0%) — the v1.10 mechanism-isolation thesis empirically validated. 2 single-instance regressions diagnosed (sympy-13031 = intervention habituation; matplotlib-14623 = score=0.0 suppression without exploratory-support fallback). 765 tests (was 728). v1.10.1 brief: exploratory-support variant for diffuse-recon + intervention-habituation clean-exit predicate.

v1.6.1 SHIPPED 2026-05-11 (tag 0a964bf, local only):

  • m5max_moe substrate hardening (6 fix vectors): tool-name strip in dispatcher + loop boundary; _WRITE_PRESSURE_MAX_TOOLS_BEFORE_FIRE = 15 OR-branch on completion-tokens gate; _POST_WRITE_IDLE_MAX = 3 clean-exit signal; LUXE_WRITE_PRESSURE=1 as maintain_suite default
  • SpecDD Lever 2 extended into maintain_suite: Fixture.forbids_create: list[str] + _inject_forbids_create_sdd writes synthetic <repo>.sdd + .git/info/exclude append; 3 fixtures opted in with cross-product JS test-name coverage
  • BFCL v3 anchors filed: raw 76.45% (regression check, no infra drift) + agent 83.71% (+7.26pp vs raw; parallel cliff +17pp; irrelevance −6.25pp)
  • 652 tests passing
  • BFCL agent run did NOT exercise Lever 1 — adapter wiring is v1.7 priority #2

What's queued for v1.10.0 — "mechanism-isolation cycle":

  1. Conditional intervention stacking — convergence as a smooth score. v1.9 evidence: soft-anchor converts "hesitant but near-solution" trajectories while harming exploratory recovery paths. Convergence signals (same_file_read_twice, grep_then_open_same_path) imply the model has formed a candidate execution locus. Don't gate on a binary primitive — compose a smooth score from repeated_same_path_access (already mined as reread_ratio), edit_preview_behavior (diff/grep/preview before write), localized_grep_density (fraction of grep matches in same file/dir as recent reads), file_entropy_last_K_events (Shannon entropy of touched paths). Intervention intensity scales with the score — low (diffuse-recon → no soft-anchor; consider exploratory-support variant), mid (standard soft-anchor), high (tighter commitment phrasing). Binary primitives are brittle against benchmark-specific trace structure.
  2. Soft-anchor wording iteration. Drop "rather than continuing broad exploration" (frames current behavior as failure; induces premature closure). Adopt positive imperative + narrow concrete next-step framing + zero mention of exploration. Candidate to A/B: "Commit to the most promising file and attempt the smallest viable corrective edit." Validation gate: smoke on benchmarks/swebench/subsets/v19_smoke_n14.json BEFORE any n=75 commit. Message variants are cheap to overfit emotionally and expensive to validate statistically.
  3. Density-gate threshold re-derivation under v19 traces. v1.9 changed trajectory shape enough that v18-inherited thresholds are no longer trustworthy. Post-intervention trajectories are NOT IID relative to pre-intervention — the intervention itself alters action cadence. Split the gate into two calibrated paths: pre_intervention_density_gate (baseline, current standalone mode) and post_intervention_density_gate (rescue, current post_bail_rescue mode) with separately calibrated decay windows and minimum action counts. Re-derive from v19 traces, not v18. New observability-only telemetry: time_to_first_write_after_intervention (wall+step delta) and write_burst_persistence (writes sustained for >N consecutive actions). Both may be more predictive than raw action density.
  4. Mechanism-level primary metric. v1.9 demonstrated empty_patch moves slowly even when named mechanisms are resolved — multiple latent failure modes contribute to one aggregate. v1.10 primary: (CONFIDENCE_COLLAPSE = 0 AND ABSTAIN_AFTER_INTERVENTION ≤ N AND intervention_conversion_rate ≥ X%). Each component is a hypothesized causal pathway; the metric is scientifically actionable. Denominator stability (critical): intervention_conversion_rate MUST be computed among intervention-fired trajectories only, not all trajectories — otherwise future trigger-policy changes (the convergence-score work above) distort apparent gains by changing the denominator. empty_patch demoted to derived secondary.

See ~/.claude/plans/serene-napping-cupcake.md §Phase E.7 for the full v1.10 design brief, including the rationale traceable to specific v1.9 trace evidence (e.g., sphinx-10435 rep_2 step-6 termination).

Iteration model: bench changes go through scripts/regrade_local.py for fast iteration on grader/linter logic without re-running luxe. Full bench re-runs reserved for end-of-phase confirmation.


The bench-as-truth pattern

Every model claim goes through:

  1. Run python -m benchmarks.maintain_suite.run --variants <yaml>.
  2. Read the printed comparison table — pass/fail/wall/tokens/bailouts per cell.
  3. Inspect every PASS PR by hand via the actual local-branch ref in the offline cache: git -C ~/.luxe/fixture-cache/<repo> diff <base_sha>..<branch_name>. Do NOT use origin/<branch> — the cache's stale GitHub-tracking refs point to old runs and silently mislead. Branch name is in ~/.luxe/runs/<run_id>/pr_state.json.
  4. Sidecar regrade with scripts/regrade_local.py --output <dir> for fast, faithful re-grading without re-running luxe (seconds vs 60-120 min). As of v1.4.1, re-runs the citation linter against the original synthesizer.md.

Real PASS count is always ≤ printed count. Every historical bake-off has had at least one false-positive PASS.


Files of consequence

Path Purpose
src/luxe/agents/single.py mono runner — agentic loop end-to-end; _build_sdd_block injects Repository contracts (v1.5)
src/luxe/agents/loop.py shared loop; Mode B write-pressure injection (v1.4.1); tool-call ceiling OR-branch + _POST_WRITE_IDLE_MAX clean exit + tc.name loop-boundary normalization (2026-05-10)
src/luxe/agents/prompts.py prompt registry + TaskOverlay; doc/manage strict variants
src/luxe/citations.py diff-aware citation linter; bare-filename fallback (v1.4.1); spec_violation/spec_orphan (v1.5)
src/luxe/sdd.py .sdd parser — seven canonical sections incl. forbids_create (v1.6), tolerant header normalization (Forbids creatingforbids_create)
src/luxe/spec_resolver.py chain assembly + glob matching — find_all_sdd, resolve_chain, format_sdd_block; is_forbidden(rel, *, creating) kwarg-only required (v1.6); all_forbids_create helper (v1.6)
src/luxe/spec.py SpecDD Lever 1 data model (Requirement, Spec, YAML round-trip)
src/luxe/spec_validator.py SpecDD Lever 1 predicate evaluator + reprompt-text helper
src/luxe/tools/base.py dispatch_tool (tool exceptions captured as retry-able errors); name.strip() at dispatch boundary tolerates whitespace from GLM-style emit shapes (2026-05-10)
src/luxe/tools/fs.py write-time honesty guards; _check_spec_forbids pre-write enforcement; creating: bool threaded (v1.6) — _write_file computes via Path.is_file(); _edit_file always False; create-only error wording for recovery gradient
src/luxe/luxe.sdd root invariants (v1.5 dogfood) — Forbids retired src/swarm/** etc.
src/luxe/agents/agents.sdd (v1.5 dogfood) — prompt registry as single source of truth
src/luxe/tools/tools.sdd (v1.5 dogfood) — honesty guards before Forbids; cve_lookup gating
benchmarks/maintain_suite/maintain_suite.sdd (v1.5 dogfood) — bench rules
CLAUDE.md (v1.5) — auto-loaded by Claude Code; points at the .sdd chain
src/luxe/backend.py chat() accepts repeat_penalty; unload_model(), loaded_models()
src/luxe/cli.py luxe maintain (mono only); --spec-yaml for SpecDD reprompt gate
src/luxe/config.py RoleConfig w/ system/task prompt + overlay ids + repeat_penalty
benchmarks/maintain_suite/run.py bench harness; Variant carries prompt + overlay overrides; _inject_forbids_create_sdd writes <repo>.sdd + appends to .git/info/exclude for per-fixture SpecDD Lever 2 (2026-05-10); LUXE_WRITE_PRESSURE=1 env default
benchmarks/maintain_suite/grade.py grading + strict gates + multi-variant v1_release_gate; Fixture.forbids_create: list[str] field (2026-05-10)
benchmarks/maintain_suite/fixtures.yaml the 10 v1 fixtures (each w/ requirements: block)
benchmarks/swebench/ SWE-bench Verified adapter (preds-only + Docker harness wrapper + compare)
benchmarks/swebench/smoke_inspect.py inspector v2 — mechanical + gold-proximity tier (--gold-source); 5 signals, line-based hunk proximity, hunk coverage
benchmarks/swebench/run.py preds-only runner; idempotent resume; --no-inject-sdd + --no-write-pressure flags (v1.5) for ablation
benchmarks/swebench/adapter.py synthetic .sdd injection (v1.5); paired-mechanism env wiring + commit.gpgsign override (v1.5.0-rc-2); SWEBENCH_SDD_BODY split into Forbids + Forbids creating (v1.6); broad **/test_*.py create-ban added
benchmarks/swebench/compare_runs.py (v1.5) — pre/post predictions delta report (per-instance + class-level + summary)
benchmarks/swebench/subsets/v1_baseline_n75.json 75 stratified instances, 12 repos — the pre-SpecDD anchor target
benchmarks/swebench/subsets/v16_smoke_n14.json (v1.6) — Phase C smoke: 4 v2 regressions + 5 v2-strong preservation + 5 random; deterministic seed 20260506
benchmarks/swebench/subsets/probe_n10.json n=10 A/B subset (4 easy + 6 medium across 10 distinct repos)
benchmarks/swebench/subsets/probe_12907.json single-instance probe used for the original hypothesis-stall trace
benchmarks/bfcl/ BFCL v3 adapter (raw + agent modes, schema converter, grader); resume + ETA in run.py
configs/single_64gb.yaml maintain_suite config — Qwen3.6-35B-A3B-6bit, manage_strict_only overlay
configs/single_64gb_swebench.yaml swebench config — swebench_strict_only overlay (anti-reproducer prompt); the n=75 default
configs/single_64gb_swebench_counterexample.yaml A/B variant with falsification clause; negative control, not promoted
scripts/regrade_local.py sidecar regrade w/ citation re-run (v1.4.1)
scripts/register_omlx_models.py symlink HF cache → ~/.omlx/models/
lessons.md running postmortem; latest entry covers v1.6 creation-only architectural shift
~/.claude/plans/fancy-honking-lerdorf.md external benchmark plan (SWE-bench n=75 + BFCL v3)
~/.claude/plans/fluffy-brewing-lemur.md SpecDD plan (Levers 1/2/3)
~/.claude/plans/humble-prancing-patterson.md v1.5.0 ship plan + failure-mode analysis
~/.claude/plans/cozy-wiggling-conway.md v1.6.0 ship plan (this session) — creation-only forbids architecture + audit gates + Phase D ship floor

oMLX configuration

~/.omlx/settings.json:

"max_model_memory": "36GB",
"idle_timeout": { "idle_timeout_seconds": 1800 },
"sampling": { "max_context_window": 49152 }

max_context_window was bumped from 32768 (default) to 49152 on 2026-05-10 during the m5max_moe bake-off — qwen3-coder-next-80B under realistic retrieval load on nothing-ever-happens-document-config hits 33k+ per turn and oMLX returns a hard 400 below the new ceiling. Qwen3 family natively supports 128k+, so 48k is well within model architecture. This is per-machine state and not version-controlled — any new bench host needs the same bump.

System-level Metal wired ceiling — kept aligned with max_model_memory:

sudo sysctl iogpu.wired_limit_mb=36864
echo "iogpu.wired_limit_mb=36864" | sudo tee -a /etc/sysctl.conf

API key for HTTP requests: export OMLX_API_KEY=omlx-sdb25582k3mq8pf9 (in user's shell init; the bench harness reads it).

Restart oMLX any time settings.json, model_settings.json, or new symlinks land: brew services restart omlx.

maintain_suite bench-host prereqs

The 10-fixture suite includes fixtures that shell out to npm test as their tests_pass predicate (neon-rain-implement-reset-shortcut). Without node + npm on the bench host, those fixtures rc=127 and are misscored as model failures. brew install node is the one-shot fix on macOS. Documented here because the toolchain prereq isn't obvious from the fixture YAML alone.


Trace instrumentation

LUXE_LOG_TOOL_CALLS=1 emits per-tool-call and per-step events to the run's events.jsonl. Permanent debugging knob (off by default, zero overhead when off):

LUXE_LOG_TOOL_CALLS=1 python -m benchmarks.maintain_suite.run --id <fixture> --force
RUN=$(jq -r .luxe_run_id acceptance/<output>/.../state.json)
jq -c 'select(.kind=="tool_call" or .kind=="tool_step_done")' ~/.luxe/runs/$RUN/events.jsonl

Mode B fix events (when LUXE_WRITE_PRESSURE=1):

jq -c 'select(.kind=="write_pressure_fired")' ~/.luxe/runs/$RUN/events.jsonl

Critical gotchas

  • oMLX idle_timeout: null keeps models resident forever. Set to 1800.
  • luxe maintain post-run unload fires by default. Bench mode uses --keep-loaded (already passed by _luxe_maintain in run.py).
  • At temp=0 the variance collapses to deterministic vectors (probe_a == probe_b across all 10 fixtures on 2026-05-01 PM). At temp=0 a 1-fixture delta IS the signal — except on SWE-bench where prompt-cache state and instance ordering can produce ±2-3 strong/empty drift between runs (the "variance budget" referenced in v1.6 ship floor).
  • Offline mode caps every fixture at 4/5gh pr create always fails (no GitHub remote), so pr_opened (1pt of 5) never fires offline. Every PASS reads as 4/5; gate math (≥8 fixtures with score ≥4) still works correctly.
  • origin/<branch> in offline-cache repos is a stale-ref trap — post-2026-05-01 runs push to local branches (refs/heads/...) which do NOT update remote-tracking refs. Use git diff base..<branch> (local ref) or sidecar regrade.
  • Dense >30B mxfp8 doesn't fit on 64GB Mac under load — granite-4.1-30b-mxfp8 spiked 22GB+ wired and pushed system into swap. MoE models (Qwen3.6-35B-A3B at ~3B active) run comfortably; dense models don't.
  • stuck_after_done doesn't always mean failure — Qwen3.6-35B-A3B often ships a real diff then trips the stuck-loop detector on cleanup. Distinguishes from stuck_no_output (never engaged).
  • run.py resume model treats status: error as skip_done by default — if a sweep dies before any model invocation, re-launching without --retry-errors silently skips every fixture and prints a zeroed Summary. Either pass --retry-errors or rm -rf the output dir.
  • is_forbidden is now kwarg-only required (v1.6)chain.is_forbidden(rel, creating=...). Callers that pass positional-only will fail at runtime. Tests use creating=False for edit-time checks; bench paths compute creating = not Path.is_file().

Recent commit trail (most recent first)

Run git log --oneline -20 for fresh state. Highlights from recent sessions:

1d848ae  maintain_suite: broaden JS forbids_create — catch hyphen-prefix variants (2026-05-10)
b00ffe1  maintain_suite: per-fixture Forbids creating + synth .sdd injection (2026-05-10)
f962ee6  agents/loop: normalize tool name at the loop boundary too (2026-05-10)
4590e68  maintain_suite: default LUXE_WRITE_PRESSURE=1 + m5max_moe runbook docs (2026-05-10)
6cf6b2a  agents/loop: WRITE_PRESSURE tool-ceiling branch + post-write idle exit (2026-05-10)
fceff7e  tools/base: tolerate whitespace in tool names from dispatch_tool (2026-05-10)
5cc3c87  maintain_suite: M5 Max bench-env prep + multi-variant repo hygiene (2026-05-10)
2240f22  docs: v1.6.0 SHIPPED — n=75 v3 + Docker harness 36/75 (48.0%)
4e9df21  swebench/harness: per-instance report aggregator for swebench >= 4.x
e49d7da  docs: RESUME.md — Phase D Step 1 done (n=75 v3 ran clean)
3174a79  docs: rewrite README for v1.6.0-rc-1 (mono-only, SpecDD Lever 2)
92ceb4c  docs: v1.6.0-rc-1 state + creation-only architectural shift entry
49c8acb  v1.6.0-rc-1: SpecDD Lever 2 — creation-only forbids (operation-aware policy)
04c8aac  docs: v1.5.0-rc-2 state + paired-mechanism v1 result + Forbids tightening
1d5b006  v1.4.1: citation-linter bare-filename fallback + Mode B write-pressure + regrade lint re-run
707bab8  v1.4.0: SpecDD Lever 1 — programmatic Definition of Done; first 10/10 bench

When in doubt

git log --oneline -20 tells the trajectory. lessons.md has postmortems of every failure pattern. The user prefers terse, action-oriented responses — don't summarize what they can read; tell them the next step.

The user is comfortable with auto mode but draws hard lines on destructive shared-system actions (oMLX config, sudo, force-push, deletes outside their workspace). When in doubt, write the change but ask before applying. Do NOT push to remote unless explicitly asked.