Benchmark + autoresearch harness for the flattening stage by mvdoc · Pull Request #68 · gallantlab/autoflatten

mvdoc · 2026-06-10T03:17:05Z

What this adds

A reproducible, provenance-tracked benchmark + autoresearch harness for optimizing the
pyflatten flattening stage, plus the first optimization result. All code lives under
benchmark/ (the installed autoflatten package is untouched); see benchmark/PLAN.md for
the full approved design and .claude/skills/autoflatten-autoresearch/ for session recovery.

Headline result: replacing the FreeSurfer-style projection + initial negative-area-removal
init with a guaranteed flip-free Tutte embedding reaches the same flattening quality at
~31% less runtime on the first hemisphere (sub-022 lh: 15.25% vs 15.20% mean distortion,
24 vs 25 flipped triangles, 454 s vs 658 s). The Tutte init starts at 0/385122 flipped
triangles (Tutte's theorem) vs the projection's 141690, so the ~4-min initial NAR phase is
unnecessary. Broader (n=4) validation and phase-removal experiments are running.

Components

paths.py / ledger.py — output-location policy + append-only JSONL provenance ledger
(commit SHA, env, seeds, metrics, artifacts, repro command, decision trace).
metrics.py / harness.py — uniform per-patch metrics + method-agnostic
evaluate(flatten_fn, ...) with on-disk k-ring caching.
build_dataset.py — manifest from the public Narratives dataset (OpenNeuro ds002345),
reusing cached projection patches (no FreeSurfer, no fetching).
run_baseline.py — FreeSurfer-clone baseline + determinism check (verified bit-identical).
probe_tutte_init.py — the flip-free (Tutte/LSCM) init probe.
experiment.py — general autoresearch runner (init method + refinement-phase toggles).
plot.py — fast flatmap renderer for visual verification.
report.py — renders the ledger into a human-readable NOTEBOOK.md.
Unit tests (benchmark/tests/), bench extra in pyproject.toml.

Output policy

Code + markdown docs live in the repo; all loop-generated artifacts (ledger, manifest,
flat patches, caches, figures, NOTEBOOK.md) are written to /data2/projects/autoflatten/,
not the repo. Each ledger record links its artifacts back to the producing commit SHA.

Status / verification

Phase A (harness + baseline + ledger) verified end-to-end on CPU; determinism confirmed.
Tutte probe verified flip-free and visually sound (clean single-blob flatmaps, no tangling).
12 unit tests pass; ruff clean.
Compute is CPU-only (the box's GPUs are blocked by an old driver, 440/CUDA 10.2).

🤖 Generated with Claude Code

Provenance-tracked, CPU-only benchmark for optimizing the pyflatten flattening stage, on the public Narratives dataset. Foundation pieces: - paths/ledger/metrics/harness: method-agnostic evaluate(flatten_fn, ...) with uniform per-patch metrics (mean/p90 % distance error, flipped triangles, area distortion) and an append-only JSONL provenance ledger (commit SHA, env, seeds, metrics, artifacts, repro command, decision trace). - build_dataset/run_baseline/report: build the manifest from cached Narratives patches, run the FreeSurfer-clone baseline (+ determinism check), render NOTEBOOK.md. - Project recovery skill (.claude/skills/autoflatten-autoresearch) so any session can recover state from the ledger and resume the loop. - PLAN.md (approved design) + README; bench extra in pyproject; unit tests. Code lives in the repo; all generated artifacts go to /data2/projects/autoflatten. See benchmark/PLAN.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Injects a guaranteed-injective Tutte embedding (libigl harmonic map, boundary pinned to a circle) as the initial map and disables the initial negative-area- removal phase, feeding the existing geodesic-stress refinement. Verified flip-free: Tutte init gives 0/385122 flipped triangles at 33.7% distortion, vs the FreeSurfer projection's 141690 flips at 172%. Logs a probe record with hypothesis + head-to-head conclusion vs the latest baseline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

benchmark/plot.py renders the saved flat patches of a ledger experiment (mesh + flipped triangles in red + boundary), pulling metrics from the ledger instead of recomputing distortion (plot_flatmap's per-vertex distortion recompute is the bottleneck: ~5s vs ~20min on a 193k-vertex hemisphere). --full keeps the 3-panel plot. Visual check on sub-022 lh: baseline and Tutte-init maps are both clean single-blob flatmaps with no tangling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Parametrized runner (init method + phase toggles + k-ring) that logs each variant to the ledger with a decision trace, for fanning out optimization experiments. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive, reproducible, and provenance-tracked benchmark and autoresearch harness for the pyflatten stage of AutoFlatten. It includes dataset building from the public Narratives dataset, a method-agnostic evaluation harness, an append-only JSONL ledger for tracking experiments, quality metrics, plotting utilities, and a Tutte/LSCM flip-free initialization probe. The reviewer's feedback is highly constructive, pointing out several robustness improvements when parsing the JSON ledger (such as using .get() and handling explicit null values to avoid KeyError and TypeError exceptions) and highlighting a potential side-effect where mutating flattener.config in-place could unintentionally affect shared configuration objects across evaluations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T03:19:54Z

+    base = Ledger().latest(kind="baseline")
+    base_m = base["metrics"] if base else None


To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.

Suggested change

base = Ledger().latest(kind="baseline")

base_m = base["metrics"] if base else None

base = Ledger().latest(kind="baseline")

base_m = base.get("metrics") if base else None

gemini-code-assist · 2026-06-10T03:19:54Z

+        flattener.config.negative_area_removal.enabled = False
+        flattener.initial_projection = lambda: init  # injected into run()


Mutating flattener.config directly modifies the shared config object passed into evaluate(). If the same config instance is reused across multiple evaluations or baseline runs, this side-effect will unexpectedly disable negative area removal for those runs as well. Consider copying the config or avoiding in-place mutation of shared arguments.

Suggested change

flattener.config.negative_area_removal.enabled = False

flattener.initial_projection = lambda: init # injected into run()

import copy

flattener.config = copy.deepcopy(flattener.config)

flattener.config.negative_area_removal.enabled = False

flattener.initial_projection = lambda: init # injected into run()

gemini-code-assist · 2026-06-10T03:19:54Z

+def _baseline_reference():
+    """Most recent baseline record's aggregate metrics, for head-to-head printing."""
+    rec = Ledger().latest(kind="baseline")
+    return rec["metrics"] if rec else None


To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.

Suggested change

def _baseline_reference():

"""Most recent baseline record's aggregate metrics, for head-to-head printing."""

rec = Ledger().latest(kind="baseline")

return rec["metrics"] if rec else None

def _baseline_reference():

"""Most recent baseline record's aggregate metrics, for head-to-head printing."""

rec = Ledger().latest(kind="baseline")

return rec.get("metrics") if rec else None

gemini-code-assist · 2026-06-10T03:19:55Z

+    records = Ledger().read()
+    record = next((r for r in records if r["experiment_id"] == experiment_id), None)


Use .get("experiment_id") instead of r["experiment_id"] to avoid raising a KeyError if any record in the ledger is missing the "experiment_id" key.

Suggested change

records = Ledger().read()

record = next((r for r in records if r["experiment_id"] == experiment_id), None)

records = Ledger().read()

record = next((r for r in records if r.get("experiment_id") == experiment_id), None)

gemini-code-assist · 2026-06-10T03:19:55Z

+    artifacts = [
+        a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch"
+    ]


If "artifacts" is explicitly set to null in the JSON ledger, record.get("artifacts", []) will return None, causing a TypeError during iteration. Using record.get("artifacts") or [] ensures it always defaults to an empty list.

Suggested change

artifacts = [

a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch"

]

artifacts = [

a for a in (record.get("artifacts") or []) if a.get("kind") == "flat_patch"

]

gemini-code-assist · 2026-06-10T03:19:55Z

+        commit = (r.get("git", {}).get("commit") or "")[:8]
+        dirty = "*" if r.get("git", {}).get("dirty") else ""


If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.

Suggested change

commit = (r.get("git", {}).get("commit") or "")[:8]

dirty = "*" if r.get("git", {}).get("dirty") else ""

git = r.get("git") or {}

commit = (git.get("commit") or "")[:8]

dirty = "*" if git.get("dirty") else ""

gemini-code-assist · 2026-06-10T03:19:55Z

+    for r in records:
+        commit = (r.get("git", {}).get("commit") or "")[:8]
+        dirty = "*" if r.get("git", {}).get("dirty") else ""
+        m = r.get("metrics", {})


If "metrics" is explicitly set to null in the JSON ledger, r.get("metrics", {}) will return None, causing an AttributeError when calling .get(). Using r.get("metrics") or {} ensures it always defaults to a dictionary.

Suggested change

m = r.get("metrics", {})

m = r.get("metrics") or {}

gemini-code-assist · 2026-06-10T03:19:55Z

+        git = r.get("git", {})
+        lines.append(
+            f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"
+            + ("  ⚠️ dirty tree" if git.get("dirty") else "")
+        )


If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.

Suggested change

git = r.get("git", {})

lines.append(

f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"

+ (" ⚠️ dirty tree" if git.get("dirty") else "")

)

git = r.get("git") or {}

lines.append(

f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"

+ (" ⚠️ dirty tree" if git.get("dirty") else "")

)

gemini-code-assist · 2026-06-10T03:19:55Z

+        env = r.get("environment", {})
+        lines.append(
+            f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "
+            f"python {env.get('python')}, host {env.get('hostname')}"
+        )


If "environment" is explicitly set to null in the JSON ledger, r.get("environment", {}) will return None, causing an AttributeError when calling .get(). Using r.get("environment") or {} ensures it always defaults to a dictionary.

Suggested change

env = r.get("environment", {})

lines.append(

f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "

f"python {env.get('python')}, host {env.get('hostname')}"

)

env = r.get("environment") or {}

lines.append(

f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "

f"python {env.get('python')}, host {env.get('hostname')}"

)

gemini-code-assist · 2026-06-10T03:19:55Z

+        decision = r.get("decision", {})
+        if decision.get("hypothesis"):


If "decision" is explicitly set to null in the JSON ledger, r.get("decision", {}) will return None, causing an AttributeError when calling .get(). Using r.get("decision") or {} ensures it always defaults to a dictionary.

Suggested change

decision = r.get("decision", {})

if decision.get("hypothesis"):

decision = r.get("decision") or {}

if decision.get("hypothesis"):

Tutte flip-free init verified across 4 hemispheres / 2 subjects: equal quality (14.75% vs 14.73% distortion, 121 vs 121 flips) at 37% less runtime. Ablations show the final NAR + spring smoothing are essential (skipping them lowers the distortion number to 13.2% but leaves ~21k flipped triangles — a folded map); only the initial NAR is removable. LSCM init ~ Tutte. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mvdoc · 2026-06-10T03:47:03Z

Autoresearch results (CPU, public Narratives benchmark)

Tutte flip-free init — verified across 4 hemispheres / 2 subjects (sub-022, sub-026):

	mean distortion	total flips	mean runtime
baseline (FS-clone)	14.73%	121	702 s
Tutte init	14.75%	121	442 s (−37%)

Equal quality (distortion +0.03pp, flips identical), ~37% faster. The Tutte map starts flip-free
(0 vs ~141k for the projection), so the initial negative-area-removal phase is unnecessary. Maps are
visually clean on both subjects.

Ablations — the final NAR and spring smoothing are NOT removable (a "numbers lie" result):
skipping the final NAR posts the lowest distortion in the study (13.21%) but leaves ~21000
flipped triangles — a folded, invalid map. Only the initial NAR is removable. LSCM init ≈ Tutte.

Full provenance for all 9 experiments is in the ledger (NOTEBOOK.md / experiments.jsonl under
the data root). Curated summary added in benchmark/FINDINGS.md.

Tests replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers on the metric+area energy, from the Tutte init. Adds optax/jaxopt to the bench extra. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…the lever L-BFGS/CG on the same energy fold (10.31% distortion but 58025 flips, 8x faster); the area-weight sweep is Pareto-worse than baseline; the epoch schedule still folds; a flip barrier keeps injectivity but stalls L-BFGS at the init. The FreeSurfer optimizer's value is the multi-level gradient smoothing + NAR, not the line search. Productive next step is multiscale/hierarchical (multigrid) optimization or SLIM, not a drop-in optimizer swap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mvdoc · 2026-06-10T12:28:22Z

Optimizer-algorithm experiments (answering "can we swap the optimizer?")

Tested replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers (scipy
L-BFGS-B / CG) on the same metric+area energy, from the flip-free Tutte init.

L-BFGS folds: reaches 10.31% distortion in 83 s (~8× faster) but 58025 flipped triangles.
Pareto-worse than baseline: sweeping the area weight can't get low distortion and low flips
(to match ~24 flips it needs ~25–29% distortion vs baseline 15%).
Epoch weight schedule doesn't help: still folds (71269 flips at the distance-dominant stage).
One-sided flip barrier: keeps injectivity (~0 flips) but L-BFGS then stalls at the init
(33.7%) — local distance-reducing moves all push toward folding.

Conclusion: the FreeSurfer optimizer's value is the multi-level gradient smoothing (coherent
coarse-to-fine moves that avoid folding) + NAR, not the line-search algorithm. A drop-in optimizer
swap is a dead end on this energy; the productive directions are multiscale/hierarchical
(geometric-multigrid) optimization or a proper SLIM local-global solver. Details in
benchmark/FINDINGS.md; all runs in the ledger (10 experiments).

Parametrizes the flatmap in the low-frequency cotangent-Laplacian eigenbasis and optimizes coarse-to-fine: band-limited maps can't fold, so the coarse solve is flip-free by construction (19 flips, 66s, 22% distortion vs L-BFGS's 58k flips). Used as a coarse init for full-resolution refinement (geometric multigrid V-cycle). Harmonics cached per mesh by content hash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Spectral (manifold-harmonic) multigrid: coarse solve is flip-free by band-limiting (19 flips, 66s, 22%) but band-limited; full V-cycle matches quality (15.3%) but isn't faster (eigsh+spectral overhead). Adam (fixed or per-level-scheduled lr) folds or stalls -- a fixed step can't span the multiscale step range. Conclusion: nothing beats the FreeSurfer multiscale line-search on this energy; the line search is the load-bearing piece. Real wins need mesh-decimation multigrid or SLIM, not a drop-in optimizer swap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…othing cap) Profiling shows per-iteration cost is gradient (177ms) + line search (272ms) + smoothing (up to 902ms at n_avg=1024). Expose --n-coarse-steps, --iters-per-level, --max-smoothing to test pure runtime reductions that don't change the energy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Profiling: per-iteration cost is gradient (177ms) + line search (272ms) + smoothing (up to 902ms at n_avg=1024). Independent, compounding speed levers measured on sub-022 lh: line search 15->7 points (-30%, no quality cost), k-ring neighbors 12->6 (-33%, +0.1pp), fewer iters (-20%), capped smoothing (-7%). Combined fast config: 262s vs 658s original baseline (~60% faster) at +0.31pp distortion. The practical path to speed is config levers + Tutte init, not a new optimizer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Stacked fast config across 4 hemispheres: 194s vs 702s baseline (-72%) at 15.10% vs 14.73% distortion, visually clean maps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…calibration Built a heat-method true-geodesic distortion yardstick; it shows the k-ring metric overstates distortion (~16% true vs 33.7% k-ring for Tutte) and mis-ranks maps. k-ring targets are ~9% inflated locally, BUT recalibrating the 1.207 correction down makes true distortion monotonically worse (18.15% -> 21.66% -> 26.88% for 1.207/1.10/1.00): the factor usefully pre-stretches to compensate curvature-induced medium-range compression. Default is near-optimal; recalibration is a dead end. Real remaining lever: per-edge geodesic targets in the energy (heavier). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…put scale Adds benchmark/truedist.py (energy-independent true-geodesic distortion, banded + global) and benchmark/probe_truedist.py (k_ring / epoch-skip / target-scale sweeps scored on the true metric). FINDINGS section 9: - targets are a confirmed dead end (Dijkstra over-estimates geodesic only ~7.8%, flat across distance; current target=Dijkstra/1.207 is ~11% too compact; 1.207 near-optimal). - k_ring 7->11 does not help (more flips, slower). - the local <=30mm metric is gameable: a conformal Tutte disk wins it but is a degenerate flatmap; the global all-pairs metric correctly ranks the full pipeline best. - validated near-free win: the final scale_to_area normalization is not distance-optimal; rescaling output by ~1.04 cuts global geodesic distortion ~0.5pp (~5% rel), reproduced on held-out geodesic sources (single-hemi; multi-hemi TBD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… term) Read the original Fischl/Sereno/Dale 1999 energy: J = lambda_d J_d + lambda_a J_a, where J_a is gated to negative-area (folded) triangles only -> fold removal, NOT area preservation. Objective is purely metric (distance) distortion; area is a dependent byproduct. scale_to_area is a display convention, not part of the objective. Corrects section 9(d): the ~1.04 scale gain is real on true geodesics but is a symptom of the target-compaction bias, and the k-ring surrogate prefers the opposite scale; not a clean free win. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… on global Adds true_distortion_full (local<=R, global all-pairs, distance-optimal scale) and a build_truegeo runner. Target-scale sweep on sub-022 lh scored on the GLOBAL metric: target_scale=1.10 (effective correction ~1.10, near the measured Dijkstra/geodesic ratio) cuts global true distortion 11.80->11.15% (-0.65pp, ~5.5% rel), flips unchanged. Reverses section 8's local-metric "dead end": on the faithful global metric, calibrating the correction toward true geodesics helps. Cross-hemisphere validation pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eep) Section 10: re-running the correction sweep scored on the GLOBAL true metric (not the gameable local one of section 8) reverses section 8's "dead end" -- larger targets (eff. correction ~1.10, toward the measured Dijkstra/geodesic ratio ~1.08) reduce global true distortion -- but the win is subject-dependent: clear on sub-022 (both hemis, -0.1 to -0.5pp), neutral/slightly-worse on sub-026 lh. Flips stay comparable. Robust component is the output-scale fix (s*~1.02 everywhere). A larger robust gain needs long-range geodesic anchors in the energy (next step, pending scope decision). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…a robust default Section 11: extended the correction sweep to 8 hemispheres across 7 subjects. Two robust negatives: (1) the local Dijkstra/geodesic ratio is nearly subject-invariant (~1.08) and does NOT predict the optimal correction, so per-subject auto-calibration won't work; (2) no correction change is universally safe -- 5/8 hemis want ~1.10, 3 prefer the original 1.207, and a milder 1.05 doesn't rescue the non-helpers (sub-075 is worse). Mean gain only -0.19pp with per-subject downside up to +0.35pp. Config-lever avenue exhausted; the only consistently safe op is the distance-optimal output scale (small); a larger robust gain needs long-range geodesic anchors in the energy. Adds 5 validation subjects (lh) to the manifest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-up) Adds a paper-facing executive summary and updates the next-ideas list to reflect the exhausted config-lever search: long-range geodesic anchors remain the one open lever for a larger robust distance-error reduction; distance-optimal output scale is the small safe ship; speed defaults (Tutte init, lean levers) are validated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-better quality Add benchmark/validate_speed.py and FINDINGS section 12. The speed levers were tuned on a single hemisphere (sub-022 lh); this re-validates the stacked fast_ultimate config across all 9 manifest hemispheres, grouped tuned/seen/ held-out, scoring quality with the energy-independent global true-geodesic metric (not the incomparable raw k-ring number). Result: 3.36x mean speedup [3.12-3.60], held-out (3.34x) matches tuned (3.42x) => not overfit. Corrects section 7's "+0.37pp worse" (a k-ring metric artifact): on the faithful metric the fast config is -0.73pp BETTER on every scored hemisphere. 10 ledger records logged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The per-vertex angular-sampling loop in compute_kring_geodesic_distances_angular was serial Python over all vertices (~4 min/hemi, the one-time cache-build cost), while 31 of 32 cores sat idle. Fold tangent-plane projection + per-ring angular sampling + limited Dijkstra into a single @njit(parallel=True) prange kernel (_angular_kring_kernel) plus an njit port of the angular sampler (_select_angular_samples_njit). Output is bit-identical to the previous numba production path: verified against the existing sub-022 lh cache (0/193174 neighbor-set mismatches, 0 distance mismatches), and pinned by a new regression test (TestAngularKernelParity). Steady-state runtime for sub-022 lh drops from ~230s to 8.7s (~26x); ~17s including cold JIT. The serial loop is kept as the use_numba=False fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… cold) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reimplement the projection stage's only FreeSurfer dependency (mri_label2label --regmethod surface) as a pure-Python union(push, pull) KDTree mapping on sphere.reg. Validated three ways: - vs real mri_label2label (FS 6.0): 0 vertex-ID differences across 6 cuts x 3 hemispheres (validate_mapping_vs_freesurfer.py) - end-to-end patch bit-identical on 9/9 manifest hemispheres incl. all 5 held-out subjects (validate_projection.py) - unit invariants, no FreeSurfer (tests/test_projection.py) Mapping step: 36.6s (FreeSurfer subprocess) -> ~0.3s (KDTree). Projection no longer requires FreeSurfer. Documented in FINDINGS section 14; ledger records logged. Phase 2 (cut-placement quality, scored by downstream flatmap distortion) is next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add benchmark/probe_refinement.py to test whether projection refinement (continuity + geodesic) improves downstream flatmap distortion, scored by the fast flattener + global true-geodesic metric (fresh k-ring + truegeo per variant). Add a `continuity` toggle to project_python for the ablation. First signal (sub-022 lh): geodesic refinement gives HIGHER global distortion (11.37%) than raw mapped cuts (10.69%); all variants are topologically valid. N=1, <1pp spread — broadening before concluding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ontinuity Ablation across 6 hemispheres (probe_refinement.py), scored by downstream fast-flatten + global true-geodesic distortion: geodesic (shipped) 11.12% worst/tied, inflates flips continuity_only 10.84% BEST (-0.28pp), fewer flips mapped_only 11.12% continuity is what helps, not thinning geodesic_curv (c) 11.24% worst; flips exploded (424 on sub-022 rh) The geodesic refinement does not help and slightly hurts: connecting cut components (continuity) is the win; thinning the thick mapped cut to a 1-vertex geodesic is the harm (removes strain relief, jagged boundary). Curvature-weighted routing (c) failed because sulci meander -> more boundary tortuosity -> more flips, confirming thinning (not the routing metric) is the core problem. Adds a `continuity` toggle and `refine_weight=curvature` (curv_alpha) path to project_python. Shippable recommendation: default projection to geodesic-refinement-off. Documented in FINDINGS section 15. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Tested the boundary-smoothing follow-on: keep the thick continuity_only cut but morphologically close its boundary (project_python morph_close=n; _morphological_close_cuts). Result across 6 hemispheres: continuity_only 10.84% BEST (robust winner) thick_close2 10.99% reduces worst flips (59->33) but distortion mixed thick_close1 11.01% same; hurts sub-026 lh by +1pp Both "smarter" refinements (curvature routing (c), boundary smoothing (2)) fail to beat plain continuity_only. The simplest answer wins: connect the cut components, don't thin, don't smooth. Coverage cost of closing is negligible (<150 verts, <0.1% of surface; medial wall ~11k dominates). Adds --save-flat to the probe for visualization. FINDINGS section 15 updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… O(n^2)) The non-angular k-ring builder allocated three O(n_vertices) arrays inside a per-vertex prange and collected results with an O(n_vertices) scan per vertex. numba's parallel analysis refuses to parallelize a loop with per-iteration allocations, so a fresh compile ran serially -> O(n^2), ~16 min on a 193k-vertex mesh (a cached build had masked this). Rewrite to parallelize over chunks: split vertices into n_chunks blocks, prange over chunks, each owns its scratch (allocated once) and resets only touched entries; collect each ring from the touched list and sort to match the reference output. Identical output (matches pure-Python get_k_ring), now ~12s and multi-core on 193k verts. Used by plotting and the uncached non-angular flatten path. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

compute_kring_geodesic_distances ran the per-vertex limited Dijkstra in a serial Python list comprehension (~16s on 193k verts). Add a chunk-parallel kernel mirroring _limited_dijkstra_numba with per-chunk scratch (allocated once, only touched entries reset), producing distances in the flat k-ring layout. Bit-exact vs the serial path (max diff 0.0 over 3000 sampled verts), multi-core. Exposes get_k_ring_fast_flat to share the flat ring layout. End-to-end flatmap plot: 43s -> 27s. Helps the uncached non-angular flatten path too. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the 193k-iteration Python loop with a flattened numpy computation (concatenate neighbor lists, segment-sum via bincount). Per-vertex distortion == 100 * sum_abs / sum_target since the neighbor count cancels. Matches the old loop (max per-vertex diff 8.5e-14). The loop is now ~0.2s. With the parallel k-ring (ring builder + distances), end-to-end flatmap plot is ~22.6s (was 43s). 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The flatmap distortion overlay now (by default in plot_flatmap): - rescales by the distance-optimal global scale s* before scoring, so a global scale offset no longer inflates the per-vertex map (shows true local shape distortion; s* is reported in the panel title); - shows SIGNED relative distortion with a diverging colormap (RdBu_r): + = flatmap stretched, - = compressed, instead of magnitude-only. compute_kring_distortion gains optimal_scale/signed/return_opt_scale flags; defaults preserve the original magnitude/raw-scale behavior (viz tests and the scaling-sensitivity test unchanged). 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…l (default on) The flattener's final scale was area-matched (s = sqrt(orig_area/total_area)), a display convention that leaves the map ~6% too small vs true geodesic distances (validated: opt scale 1.063 +/- 0.008 across 6 hemispheres, expand). Add distance_optimal_scale: after optimization, rescale the output by the single global scale that minimizes true-geodesic distortion, computed from a deterministic heat-method geodesic sample on the fiducial patch. So the surface and the flat map are directly comparable (the use case: viewing inflated + flattened together). New DistanceOptimalScaleConfig (enabled by default, n_sources=200, seed=0); distance_optimal_scale() in distance.py; applied in SurfaceFlattener.run() before final stats. Deterministic (fixed seed; heat method is deterministic). End-to-end: s=1.055 on sub-022 lh, true distortion 13.87% -> 13.23% (-0.64pp). viz.plot_flatmap now scores at the true (already-faithful) output scale instead of re-normalizing by the gameable local k-ring optimum. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ated) Document the shipped distance-optimal output scale: true-geodesic optimum 1.063 +/- 0.008 (expand) across 6 hemispheres, lowers global distortion on every one; replaces area-matching as the default. Note the sign correction (local k-ring optimum points the wrong way, ~0.95 shrink, due to target compaction). Update executive summary + Next-ideas (marked shipped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mvdoc and others added 6 commits June 9, 2026 18:32

Add unit tests for Tutte flip-free init and area scaling

b9da0b7

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add general autoresearch experiment runner

9d5982c

Parametrized runner (init method + phase toggles + k-ring) that logs each variant to the ledger with a decision trace, for fanning out optimization experiments. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Document experiment.py runner and plot.py in benchmark README

0c9428c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

mvdoc and others added 2 commits June 10, 2026 05:21

Add optimizer-swap probe (L-BFGS/CG/Adam on the same energy)

9a93b44

Tests replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers on the metric+area energy, from the Tutte init. Adds optax/jaxopt to the bench extra. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mvdoc and others added 18 commits June 10, 2026 05:45

Findings: validated 3.6x speedup (fast_ultimate, n=4)

025b302

Stacked fast config across 4 hemispheres: 194s vs 702s baseline (-72%) at 15.10% vs 14.73% distortion, visually clean maps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Findings: record k-ring parallelization (~26x, bit-identical)

e3ad610

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Findings: authoritative end-to-end timing (4 hemis, ~205s/hemi, ~4.3x…

df3d7e5

… cold) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mvdoc marked this pull request as draft June 11, 2026 17:53

mvdoc and others added 7 commits June 11, 2026 12:25

		base = Ledger().latest(kind="baseline")
		base_m = base["metrics"] if base else None

		flattener.config.negative_area_removal.enabled = False
		flattener.initial_projection = lambda: init # injected into run()

		records = Ledger().read()
		record = next((r for r in records if r["experiment_id"] == experiment_id), None)

		commit = (r.get("git", {}).get("commit") or "")[:8]
		dirty = "*" if r.get("git", {}).get("dirty") else ""

-        commit = (r.get("git", {}).get("commit") or "")[:8]
-        dirty = "*" if r.get("git", {}).get("dirty") else ""
+        git = r.get("git") or {}
+        commit = (git.get("commit") or "")[:8]
+        dirty = "*" if git.get("dirty") else ""

		decision = r.get("decision", {})
		if decision.get("hypothesis"):

Conversation

mvdoc commented Jun 10, 2026

What this adds

Components

Output policy

Status / verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

mvdoc commented Jun 10, 2026

Autoresearch results (CPU, public Narratives benchmark)

Uh oh!

mvdoc commented Jun 10, 2026

Optimizer-algorithm experiments (answering "can we swap the optimizer?")

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant