Benchmark + autoresearch harness for the flattening stage#68
Conversation
Provenance-tracked, CPU-only benchmark for optimizing the pyflatten flattening stage, on the public Narratives dataset. Foundation pieces: - paths/ledger/metrics/harness: method-agnostic evaluate(flatten_fn, ...) with uniform per-patch metrics (mean/p90 % distance error, flipped triangles, area distortion) and an append-only JSONL provenance ledger (commit SHA, env, seeds, metrics, artifacts, repro command, decision trace). - build_dataset/run_baseline/report: build the manifest from cached Narratives patches, run the FreeSurfer-clone baseline (+ determinism check), render NOTEBOOK.md. - Project recovery skill (.claude/skills/autoflatten-autoresearch) so any session can recover state from the ledger and resume the loop. - PLAN.md (approved design) + README; bench extra in pyproject; unit tests. Code lives in the repo; all generated artifacts go to /data2/projects/autoflatten. See benchmark/PLAN.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Injects a guaranteed-injective Tutte embedding (libigl harmonic map, boundary pinned to a circle) as the initial map and disables the initial negative-area- removal phase, feeding the existing geodesic-stress refinement. Verified flip-free: Tutte init gives 0/385122 flipped triangles at 33.7% distortion, vs the FreeSurfer projection's 141690 flips at 172%. Logs a probe record with hypothesis + head-to-head conclusion vs the latest baseline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
benchmark/plot.py renders the saved flat patches of a ledger experiment (mesh + flipped triangles in red + boundary), pulling metrics from the ledger instead of recomputing distortion (plot_flatmap's per-vertex distortion recompute is the bottleneck: ~5s vs ~20min on a 193k-vertex hemisphere). --full keeps the 3-panel plot. Visual check on sub-022 lh: baseline and Tutte-init maps are both clean single-blob flatmaps with no tangling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Parametrized runner (init method + phase toggles + k-ring) that logs each variant to the ledger with a decision trace, for fanning out optimization experiments. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive, reproducible, and provenance-tracked benchmark and autoresearch harness for the pyflatten stage of AutoFlatten. It includes dataset building from the public Narratives dataset, a method-agnostic evaluation harness, an append-only JSONL ledger for tracking experiments, quality metrics, plotting utilities, and a Tutte/LSCM flip-free initialization probe. The reviewer's feedback is highly constructive, pointing out several robustness improvements when parsing the JSON ledger (such as using .get() and handling explicit null values to avoid KeyError and TypeError exceptions) and highlighting a potential side-effect where mutating flattener.config in-place could unintentionally affect shared configuration objects across evaluations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| base = Ledger().latest(kind="baseline") | ||
| base_m = base["metrics"] if base else None |
There was a problem hiding this comment.
To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.
| base = Ledger().latest(kind="baseline") | |
| base_m = base["metrics"] if base else None | |
| base = Ledger().latest(kind="baseline") | |
| base_m = base.get("metrics") if base else None |
| flattener.config.negative_area_removal.enabled = False | ||
| flattener.initial_projection = lambda: init # injected into run() |
There was a problem hiding this comment.
Mutating flattener.config directly modifies the shared config object passed into evaluate(). If the same config instance is reused across multiple evaluations or baseline runs, this side-effect will unexpectedly disable negative area removal for those runs as well. Consider copying the config or avoiding in-place mutation of shared arguments.
| flattener.config.negative_area_removal.enabled = False | |
| flattener.initial_projection = lambda: init # injected into run() | |
| import copy | |
| flattener.config = copy.deepcopy(flattener.config) | |
| flattener.config.negative_area_removal.enabled = False | |
| flattener.initial_projection = lambda: init # injected into run() |
| def _baseline_reference(): | ||
| """Most recent baseline record's aggregate metrics, for head-to-head printing.""" | ||
| rec = Ledger().latest(kind="baseline") | ||
| return rec["metrics"] if rec else None |
There was a problem hiding this comment.
To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.
| def _baseline_reference(): | |
| """Most recent baseline record's aggregate metrics, for head-to-head printing.""" | |
| rec = Ledger().latest(kind="baseline") | |
| return rec["metrics"] if rec else None | |
| def _baseline_reference(): | |
| """Most recent baseline record's aggregate metrics, for head-to-head printing.""" | |
| rec = Ledger().latest(kind="baseline") | |
| return rec.get("metrics") if rec else None |
| records = Ledger().read() | ||
| record = next((r for r in records if r["experiment_id"] == experiment_id), None) |
There was a problem hiding this comment.
Use .get("experiment_id") instead of r["experiment_id"] to avoid raising a KeyError if any record in the ledger is missing the "experiment_id" key.
| records = Ledger().read() | |
| record = next((r for r in records if r["experiment_id"] == experiment_id), None) | |
| records = Ledger().read() | |
| record = next((r for r in records if r.get("experiment_id") == experiment_id), None) |
| artifacts = [ | ||
| a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch" | ||
| ] |
There was a problem hiding this comment.
If "artifacts" is explicitly set to null in the JSON ledger, record.get("artifacts", []) will return None, causing a TypeError during iteration. Using record.get("artifacts") or [] ensures it always defaults to an empty list.
| artifacts = [ | |
| a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch" | |
| ] | |
| artifacts = [ | |
| a for a in (record.get("artifacts") or []) if a.get("kind") == "flat_patch" | |
| ] |
| commit = (r.get("git", {}).get("commit") or "")[:8] | ||
| dirty = "*" if r.get("git", {}).get("dirty") else "" |
There was a problem hiding this comment.
If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.
| commit = (r.get("git", {}).get("commit") or "")[:8] | |
| dirty = "*" if r.get("git", {}).get("dirty") else "" | |
| git = r.get("git") or {} | |
| commit = (git.get("commit") or "")[:8] | |
| dirty = "*" if git.get("dirty") else "" |
| for r in records: | ||
| commit = (r.get("git", {}).get("commit") or "")[:8] | ||
| dirty = "*" if r.get("git", {}).get("dirty") else "" | ||
| m = r.get("metrics", {}) |
There was a problem hiding this comment.
| git = r.get("git", {}) | ||
| lines.append( | ||
| f"- **commit**: `{git.get('commit')}` ({git.get('branch')})" | ||
| + (" ⚠️ dirty tree" if git.get("dirty") else "") | ||
| ) |
There was a problem hiding this comment.
If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.
| git = r.get("git", {}) | |
| lines.append( | |
| f"- **commit**: `{git.get('commit')}` ({git.get('branch')})" | |
| + (" ⚠️ dirty tree" if git.get("dirty") else "") | |
| ) | |
| git = r.get("git") or {} | |
| lines.append( | |
| f"- **commit**: `{git.get('commit')}` ({git.get('branch')})" | |
| + (" ⚠️ dirty tree" if git.get("dirty") else "") | |
| ) |
| env = r.get("environment", {}) | ||
| lines.append( | ||
| f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], " | ||
| f"python {env.get('python')}, host {env.get('hostname')}" | ||
| ) |
There was a problem hiding this comment.
If "environment" is explicitly set to null in the JSON ledger, r.get("environment", {}) will return None, causing an AttributeError when calling .get(). Using r.get("environment") or {} ensures it always defaults to a dictionary.
| env = r.get("environment", {}) | |
| lines.append( | |
| f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], " | |
| f"python {env.get('python')}, host {env.get('hostname')}" | |
| ) | |
| env = r.get("environment") or {} | |
| lines.append( | |
| f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], " | |
| f"python {env.get('python')}, host {env.get('hostname')}" | |
| ) |
| decision = r.get("decision", {}) | ||
| if decision.get("hypothesis"): |
There was a problem hiding this comment.
If "decision" is explicitly set to null in the JSON ledger, r.get("decision", {}) will return None, causing an AttributeError when calling .get(). Using r.get("decision") or {} ensures it always defaults to a dictionary.
| decision = r.get("decision", {}) | |
| if decision.get("hypothesis"): | |
| decision = r.get("decision") or {} | |
| if decision.get("hypothesis"): |
Tutte flip-free init verified across 4 hemispheres / 2 subjects: equal quality (14.75% vs 14.73% distortion, 121 vs 121 flips) at 37% less runtime. Ablations show the final NAR + spring smoothing are essential (skipping them lowers the distortion number to 13.2% but leaves ~21k flipped triangles — a folded map); only the initial NAR is removable. LSCM init ~ Tutte. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Autoresearch results (CPU, public Narratives benchmark)Tutte flip-free init — verified across 4 hemispheres / 2 subjects (sub-022, sub-026):
Equal quality (distortion +0.03pp, flips identical), ~37% faster. The Tutte map starts flip-free Ablations — the final NAR and spring smoothing are NOT removable (a "numbers lie" result): Full provenance for all 9 experiments is in the ledger ( |
Tests replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers on the metric+area energy, from the Tutte init. Adds optax/jaxopt to the bench extra. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…the lever L-BFGS/CG on the same energy fold (10.31% distortion but 58025 flips, 8x faster); the area-weight sweep is Pareto-worse than baseline; the epoch schedule still folds; a flip barrier keeps injectivity but stalls L-BFGS at the init. The FreeSurfer optimizer's value is the multi-level gradient smoothing + NAR, not the line search. Productive next step is multiscale/hierarchical (multigrid) optimization or SLIM, not a drop-in optimizer swap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Optimizer-algorithm experiments (answering "can we swap the optimizer?")Tested replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers (scipy
Conclusion: the FreeSurfer optimizer's value is the multi-level gradient smoothing (coherent |
Parametrizes the flatmap in the low-frequency cotangent-Laplacian eigenbasis and optimizes coarse-to-fine: band-limited maps can't fold, so the coarse solve is flip-free by construction (19 flips, 66s, 22% distortion vs L-BFGS's 58k flips). Used as a coarse init for full-resolution refinement (geometric multigrid V-cycle). Harmonics cached per mesh by content hash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Spectral (manifold-harmonic) multigrid: coarse solve is flip-free by band-limiting (19 flips, 66s, 22%) but band-limited; full V-cycle matches quality (15.3%) but isn't faster (eigsh+spectral overhead). Adam (fixed or per-level-scheduled lr) folds or stalls -- a fixed step can't span the multiscale step range. Conclusion: nothing beats the FreeSurfer multiscale line-search on this energy; the line search is the load-bearing piece. Real wins need mesh-decimation multigrid or SLIM, not a drop-in optimizer swap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…othing cap) Profiling shows per-iteration cost is gradient (177ms) + line search (272ms) + smoothing (up to 902ms at n_avg=1024). Expose --n-coarse-steps, --iters-per-level, --max-smoothing to test pure runtime reductions that don't change the energy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Profiling: per-iteration cost is gradient (177ms) + line search (272ms) + smoothing (up to 902ms at n_avg=1024). Independent, compounding speed levers measured on sub-022 lh: line search 15->7 points (-30%, no quality cost), k-ring neighbors 12->6 (-33%, +0.1pp), fewer iters (-20%), capped smoothing (-7%). Combined fast config: 262s vs 658s original baseline (~60% faster) at +0.31pp distortion. The practical path to speed is config levers + Tutte init, not a new optimizer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stacked fast config across 4 hemispheres: 194s vs 702s baseline (-72%) at 15.10% vs 14.73% distortion, visually clean maps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…calibration Built a heat-method true-geodesic distortion yardstick; it shows the k-ring metric overstates distortion (~16% true vs 33.7% k-ring for Tutte) and mis-ranks maps. k-ring targets are ~9% inflated locally, BUT recalibrating the 1.207 correction down makes true distortion monotonically worse (18.15% -> 21.66% -> 26.88% for 1.207/1.10/1.00): the factor usefully pre-stretches to compensate curvature-induced medium-range compression. Default is near-optimal; recalibration is a dead end. Real remaining lever: per-edge geodesic targets in the energy (heavier). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…put scale Adds benchmark/truedist.py (energy-independent true-geodesic distortion, banded + global) and benchmark/probe_truedist.py (k_ring / epoch-skip / target-scale sweeps scored on the true metric). FINDINGS section 9: - targets are a confirmed dead end (Dijkstra over-estimates geodesic only ~7.8%, flat across distance; current target=Dijkstra/1.207 is ~11% too compact; 1.207 near-optimal). - k_ring 7->11 does not help (more flips, slower). - the local <=30mm metric is gameable: a conformal Tutte disk wins it but is a degenerate flatmap; the global all-pairs metric correctly ranks the full pipeline best. - validated near-free win: the final scale_to_area normalization is not distance-optimal; rescaling output by ~1.04 cuts global geodesic distortion ~0.5pp (~5% rel), reproduced on held-out geodesic sources (single-hemi; multi-hemi TBD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… term) Read the original Fischl/Sereno/Dale 1999 energy: J = lambda_d J_d + lambda_a J_a, where J_a is gated to negative-area (folded) triangles only -> fold removal, NOT area preservation. Objective is purely metric (distance) distortion; area is a dependent byproduct. scale_to_area is a display convention, not part of the objective. Corrects section 9(d): the ~1.04 scale gain is real on true geodesics but is a symptom of the target-compaction bias, and the k-ring surrogate prefers the opposite scale; not a clean free win. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… on global Adds true_distortion_full (local<=R, global all-pairs, distance-optimal scale) and a build_truegeo runner. Target-scale sweep on sub-022 lh scored on the GLOBAL metric: target_scale=1.10 (effective correction ~1.10, near the measured Dijkstra/geodesic ratio) cuts global true distortion 11.80->11.15% (-0.65pp, ~5.5% rel), flips unchanged. Reverses section 8's local-metric "dead end": on the faithful global metric, calibrating the correction toward true geodesics helps. Cross-hemisphere validation pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eep) Section 10: re-running the correction sweep scored on the GLOBAL true metric (not the gameable local one of section 8) reverses section 8's "dead end" -- larger targets (eff. correction ~1.10, toward the measured Dijkstra/geodesic ratio ~1.08) reduce global true distortion -- but the win is subject-dependent: clear on sub-022 (both hemis, -0.1 to -0.5pp), neutral/slightly-worse on sub-026 lh. Flips stay comparable. Robust component is the output-scale fix (s*~1.02 everywhere). A larger robust gain needs long-range geodesic anchors in the energy (next step, pending scope decision). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…a robust default Section 11: extended the correction sweep to 8 hemispheres across 7 subjects. Two robust negatives: (1) the local Dijkstra/geodesic ratio is nearly subject-invariant (~1.08) and does NOT predict the optimal correction, so per-subject auto-calibration won't work; (2) no correction change is universally safe -- 5/8 hemis want ~1.10, 3 prefer the original 1.207, and a milder 1.05 doesn't rescue the non-helpers (sub-075 is worse). Mean gain only -0.19pp with per-subject downside up to +0.35pp. Config-lever avenue exhausted; the only consistently safe op is the distance-optimal output scale (small); a larger robust gain needs long-range geodesic anchors in the energy. Adds 5 validation subjects (lh) to the manifest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-up) Adds a paper-facing executive summary and updates the next-ideas list to reflect the exhausted config-lever search: long-range geodesic anchors remain the one open lever for a larger robust distance-error reduction; distance-optimal output scale is the small safe ship; speed defaults (Tutte init, lean levers) are validated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-better quality Add benchmark/validate_speed.py and FINDINGS section 12. The speed levers were tuned on a single hemisphere (sub-022 lh); this re-validates the stacked fast_ultimate config across all 9 manifest hemispheres, grouped tuned/seen/ held-out, scoring quality with the energy-independent global true-geodesic metric (not the incomparable raw k-ring number). Result: 3.36x mean speedup [3.12-3.60], held-out (3.34x) matches tuned (3.42x) => not overfit. Corrects section 7's "+0.37pp worse" (a k-ring metric artifact): on the faithful metric the fast config is -0.73pp BETTER on every scored hemisphere. 10 ledger records logged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-vertex angular-sampling loop in compute_kring_geodesic_distances_angular was serial Python over all vertices (~4 min/hemi, the one-time cache-build cost), while 31 of 32 cores sat idle. Fold tangent-plane projection + per-ring angular sampling + limited Dijkstra into a single @njit(parallel=True) prange kernel (_angular_kring_kernel) plus an njit port of the angular sampler (_select_angular_samples_njit). Output is bit-identical to the previous numba production path: verified against the existing sub-022 lh cache (0/193174 neighbor-set mismatches, 0 distance mismatches), and pinned by a new regression test (TestAngularKernelParity). Steady-state runtime for sub-022 lh drops from ~230s to 8.7s (~26x); ~17s including cold JIT. The serial loop is kept as the use_numba=False fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… cold) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reimplement the projection stage's only FreeSurfer dependency (mri_label2label --regmethod surface) as a pure-Python union(push, pull) KDTree mapping on sphere.reg. Validated three ways: - vs real mri_label2label (FS 6.0): 0 vertex-ID differences across 6 cuts x 3 hemispheres (validate_mapping_vs_freesurfer.py) - end-to-end patch bit-identical on 9/9 manifest hemispheres incl. all 5 held-out subjects (validate_projection.py) - unit invariants, no FreeSurfer (tests/test_projection.py) Mapping step: 36.6s (FreeSurfer subprocess) -> ~0.3s (KDTree). Projection no longer requires FreeSurfer. Documented in FINDINGS section 14; ledger records logged. Phase 2 (cut-placement quality, scored by downstream flatmap distortion) is next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add benchmark/probe_refinement.py to test whether projection refinement (continuity + geodesic) improves downstream flatmap distortion, scored by the fast flattener + global true-geodesic metric (fresh k-ring + truegeo per variant). Add a `continuity` toggle to project_python for the ablation. First signal (sub-022 lh): geodesic refinement gives HIGHER global distortion (11.37%) than raw mapped cuts (10.69%); all variants are topologically valid. N=1, <1pp spread — broadening before concluding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ontinuity Ablation across 6 hemispheres (probe_refinement.py), scored by downstream fast-flatten + global true-geodesic distortion: geodesic (shipped) 11.12% worst/tied, inflates flips continuity_only 10.84% BEST (-0.28pp), fewer flips mapped_only 11.12% continuity is what helps, not thinning geodesic_curv (c) 11.24% worst; flips exploded (424 on sub-022 rh) The geodesic refinement does not help and slightly hurts: connecting cut components (continuity) is the win; thinning the thick mapped cut to a 1-vertex geodesic is the harm (removes strain relief, jagged boundary). Curvature-weighted routing (c) failed because sulci meander -> more boundary tortuosity -> more flips, confirming thinning (not the routing metric) is the core problem. Adds a `continuity` toggle and `refine_weight=curvature` (curv_alpha) path to project_python. Shippable recommendation: default projection to geodesic-refinement-off. Documented in FINDINGS section 15. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tested the boundary-smoothing follow-on: keep the thick continuity_only cut but morphologically close its boundary (project_python morph_close=n; _morphological_close_cuts). Result across 6 hemispheres: continuity_only 10.84% BEST (robust winner) thick_close2 10.99% reduces worst flips (59->33) but distortion mixed thick_close1 11.01% same; hurts sub-026 lh by +1pp Both "smarter" refinements (curvature routing (c), boundary smoothing (2)) fail to beat plain continuity_only. The simplest answer wins: connect the cut components, don't thin, don't smooth. Coverage cost of closing is negligible (<150 verts, <0.1% of surface; medial wall ~11k dominates). Adds --save-flat to the probe for visualization. FINDINGS section 15 updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… O(n^2)) The non-angular k-ring builder allocated three O(n_vertices) arrays inside a per-vertex prange and collected results with an O(n_vertices) scan per vertex. numba's parallel analysis refuses to parallelize a loop with per-iteration allocations, so a fresh compile ran serially -> O(n^2), ~16 min on a 193k-vertex mesh (a cached build had masked this). Rewrite to parallelize over chunks: split vertices into n_chunks blocks, prange over chunks, each owns its scratch (allocated once) and resets only touched entries; collect each ring from the touched list and sort to match the reference output. Identical output (matches pure-Python get_k_ring), now ~12s and multi-core on 193k verts. Used by plotting and the uncached non-angular flatten path. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_kring_geodesic_distances ran the per-vertex limited Dijkstra in a serial Python list comprehension (~16s on 193k verts). Add a chunk-parallel kernel mirroring _limited_dijkstra_numba with per-chunk scratch (allocated once, only touched entries reset), producing distances in the flat k-ring layout. Bit-exact vs the serial path (max diff 0.0 over 3000 sampled verts), multi-core. Exposes get_k_ring_fast_flat to share the flat ring layout. End-to-end flatmap plot: 43s -> 27s. Helps the uncached non-angular flatten path too. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the 193k-iteration Python loop with a flattened numpy computation (concatenate neighbor lists, segment-sum via bincount). Per-vertex distortion == 100 * sum_abs / sum_target since the neighbor count cancels. Matches the old loop (max per-vertex diff 8.5e-14). The loop is now ~0.2s. With the parallel k-ring (ring builder + distances), end-to-end flatmap plot is ~22.6s (was 43s). 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The flatmap distortion overlay now (by default in plot_flatmap): - rescales by the distance-optimal global scale s* before scoring, so a global scale offset no longer inflates the per-vertex map (shows true local shape distortion; s* is reported in the panel title); - shows SIGNED relative distortion with a diverging colormap (RdBu_r): + = flatmap stretched, - = compressed, instead of magnitude-only. compute_kring_distortion gains optimal_scale/signed/return_opt_scale flags; defaults preserve the original magnitude/raw-scale behavior (viz tests and the scaling-sensitivity test unchanged). 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l (default on) The flattener's final scale was area-matched (s = sqrt(orig_area/total_area)), a display convention that leaves the map ~6% too small vs true geodesic distances (validated: opt scale 1.063 +/- 0.008 across 6 hemispheres, expand). Add distance_optimal_scale: after optimization, rescale the output by the single global scale that minimizes true-geodesic distortion, computed from a deterministic heat-method geodesic sample on the fiducial patch. So the surface and the flat map are directly comparable (the use case: viewing inflated + flattened together). New DistanceOptimalScaleConfig (enabled by default, n_sources=200, seed=0); distance_optimal_scale() in distance.py; applied in SurfaceFlattener.run() before final stats. Deterministic (fixed seed; heat method is deterministic). End-to-end: s=1.055 on sub-022 lh, true distortion 13.87% -> 13.23% (-0.64pp). viz.plot_flatmap now scores at the true (already-faithful) output scale instead of re-normalizing by the gameable local k-ring optimum. 313 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ated) Document the shipped distance-optimal output scale: true-geodesic optimum 1.063 +/- 0.008 (expand) across 6 hemispheres, lowers global distortion on every one; replaces area-matching as the default. Note the sign correction (local k-ring optimum points the wrong way, ~0.95 shrink, due to target compaction). Update executive summary + Next-ideas (marked shipped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What this adds
A reproducible, provenance-tracked benchmark + autoresearch harness for optimizing the
pyflatten flattening stage, plus the first optimization result. All code lives under
benchmark/(the installedautoflattenpackage is untouched); seebenchmark/PLAN.mdforthe full approved design and
.claude/skills/autoflatten-autoresearch/for session recovery.Headline result: replacing the FreeSurfer-style projection + initial negative-area-removal
init with a guaranteed flip-free Tutte embedding reaches the same flattening quality at
~31% less runtime on the first hemisphere (
sub-022 lh: 15.25% vs 15.20% mean distortion,24 vs 25 flipped triangles, 454 s vs 658 s). The Tutte init starts at 0/385122 flipped
triangles (Tutte's theorem) vs the projection's 141690, so the ~4-min initial NAR phase is
unnecessary. Broader (n=4) validation and phase-removal experiments are running.
Components
paths.py/ledger.py— output-location policy + append-only JSONL provenance ledger(commit SHA, env, seeds, metrics, artifacts, repro command, decision trace).
metrics.py/harness.py— uniform per-patch metrics + method-agnosticevaluate(flatten_fn, ...)with on-disk k-ring caching.build_dataset.py— manifest from the public Narratives dataset (OpenNeuro ds002345),reusing cached projection patches (no FreeSurfer, no fetching).
run_baseline.py— FreeSurfer-clone baseline + determinism check (verified bit-identical).probe_tutte_init.py— the flip-free (Tutte/LSCM) init probe.experiment.py— general autoresearch runner (init method + refinement-phase toggles).plot.py— fast flatmap renderer for visual verification.report.py— renders the ledger into a human-readableNOTEBOOK.md.benchmark/tests/),benchextra inpyproject.toml.Output policy
Code + markdown docs live in the repo; all loop-generated artifacts (ledger, manifest,
flat patches, caches, figures,
NOTEBOOK.md) are written to/data2/projects/autoflatten/,not the repo. Each ledger record links its artifacts back to the producing commit SHA.
Status / verification
🤖 Generated with Claude Code