Skip to content

Benchmark + autoresearch harness for the flattening stage#68

Draft
mvdoc wants to merge 35 commits into
devfrom
benchmark-autoresearch
Draft

Benchmark + autoresearch harness for the flattening stage#68
mvdoc wants to merge 35 commits into
devfrom
benchmark-autoresearch

Conversation

@mvdoc

@mvdoc mvdoc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What this adds

A reproducible, provenance-tracked benchmark + autoresearch harness for optimizing the
pyflatten flattening stage, plus the first optimization result. All code lives under
benchmark/ (the installed autoflatten package is untouched); see benchmark/PLAN.md for
the full approved design and .claude/skills/autoflatten-autoresearch/ for session recovery.

Headline result: replacing the FreeSurfer-style projection + initial negative-area-removal
init with a guaranteed flip-free Tutte embedding reaches the same flattening quality at
~31% less runtime on the first hemisphere (sub-022 lh: 15.25% vs 15.20% mean distortion,
24 vs 25 flipped triangles, 454 s vs 658 s). The Tutte init starts at 0/385122 flipped
triangles
(Tutte's theorem) vs the projection's 141690, so the ~4-min initial NAR phase is
unnecessary. Broader (n=4) validation and phase-removal experiments are running.

Components

  • paths.py / ledger.py — output-location policy + append-only JSONL provenance ledger
    (commit SHA, env, seeds, metrics, artifacts, repro command, decision trace).
  • metrics.py / harness.py — uniform per-patch metrics + method-agnostic
    evaluate(flatten_fn, ...) with on-disk k-ring caching.
  • build_dataset.py — manifest from the public Narratives dataset (OpenNeuro ds002345),
    reusing cached projection patches (no FreeSurfer, no fetching).
  • run_baseline.py — FreeSurfer-clone baseline + determinism check (verified bit-identical).
  • probe_tutte_init.py — the flip-free (Tutte/LSCM) init probe.
  • experiment.py — general autoresearch runner (init method + refinement-phase toggles).
  • plot.py — fast flatmap renderer for visual verification.
  • report.py — renders the ledger into a human-readable NOTEBOOK.md.
  • Unit tests (benchmark/tests/), bench extra in pyproject.toml.

Output policy

Code + markdown docs live in the repo; all loop-generated artifacts (ledger, manifest,
flat patches, caches, figures, NOTEBOOK.md) are written to /data2/projects/autoflatten/,
not the repo. Each ledger record links its artifacts back to the producing commit SHA.

Status / verification

  • Phase A (harness + baseline + ledger) verified end-to-end on CPU; determinism confirmed.
  • Tutte probe verified flip-free and visually sound (clean single-blob flatmaps, no tangling).
  • 12 unit tests pass; ruff clean.
  • Compute is CPU-only (the box's GPUs are blocked by an old driver, 440/CUDA 10.2).

🤖 Generated with Claude Code

mvdoc and others added 6 commits June 9, 2026 18:32
Provenance-tracked, CPU-only benchmark for optimizing the pyflatten flattening
stage, on the public Narratives dataset. Foundation pieces:

- paths/ledger/metrics/harness: method-agnostic evaluate(flatten_fn, ...) with
  uniform per-patch metrics (mean/p90 % distance error, flipped triangles, area
  distortion) and an append-only JSONL provenance ledger (commit SHA, env, seeds,
  metrics, artifacts, repro command, decision trace).
- build_dataset/run_baseline/report: build the manifest from cached Narratives
  patches, run the FreeSurfer-clone baseline (+ determinism check), render
  NOTEBOOK.md.
- Project recovery skill (.claude/skills/autoflatten-autoresearch) so any session
  can recover state from the ledger and resume the loop.
- PLAN.md (approved design) + README; bench extra in pyproject; unit tests.

Code lives in the repo; all generated artifacts go to /data2/projects/autoflatten.
See benchmark/PLAN.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Injects a guaranteed-injective Tutte embedding (libigl harmonic map, boundary
pinned to a circle) as the initial map and disables the initial negative-area-
removal phase, feeding the existing geodesic-stress refinement. Verified flip-free:
Tutte init gives 0/385122 flipped triangles at 33.7% distortion, vs the FreeSurfer
projection's 141690 flips at 172%. Logs a probe record with hypothesis + head-to-head
conclusion vs the latest baseline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
benchmark/plot.py renders the saved flat patches of a ledger experiment (mesh +
flipped triangles in red + boundary), pulling metrics from the ledger instead of
recomputing distortion (plot_flatmap's per-vertex distortion recompute is the
bottleneck: ~5s vs ~20min on a 193k-vertex hemisphere). --full keeps the 3-panel
plot. Visual check on sub-022 lh: baseline and Tutte-init maps are both clean
single-blob flatmaps with no tangling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Parametrized runner (init method + phase toggles + k-ring) that logs each variant
to the ledger with a decision trace, for fanning out optimization experiments.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive, reproducible, and provenance-tracked benchmark and autoresearch harness for the pyflatten stage of AutoFlatten. It includes dataset building from the public Narratives dataset, a method-agnostic evaluation harness, an append-only JSONL ledger for tracking experiments, quality metrics, plotting utilities, and a Tutte/LSCM flip-free initialization probe. The reviewer's feedback is highly constructive, pointing out several robustness improvements when parsing the JSON ledger (such as using .get() and handling explicit null values to avoid KeyError and TypeError exceptions) and highlighting a potential side-effect where mutating flattener.config in-place could unintentionally affect shared configuration objects across evaluations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread benchmark/experiment.py
Comment on lines +148 to +149
base = Ledger().latest(kind="baseline")
base_m = base["metrics"] if base else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.

Suggested change
base = Ledger().latest(kind="baseline")
base_m = base["metrics"] if base else None
base = Ledger().latest(kind="baseline")
base_m = base.get("metrics") if base else None

Comment on lines +123 to +124
flattener.config.negative_area_removal.enabled = False
flattener.initial_projection = lambda: init # injected into run()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mutating flattener.config directly modifies the shared config object passed into evaluate(). If the same config instance is reused across multiple evaluations or baseline runs, this side-effect will unexpectedly disable negative area removal for those runs as well. Consider copying the config or avoiding in-place mutation of shared arguments.

Suggested change
flattener.config.negative_area_removal.enabled = False
flattener.initial_projection = lambda: init # injected into run()
import copy
flattener.config = copy.deepcopy(flattener.config)
flattener.config.negative_area_removal.enabled = False
flattener.initial_projection = lambda: init # injected into run()

Comment on lines +133 to +136
def _baseline_reference():
"""Most recent baseline record's aggregate metrics, for head-to-head printing."""
rec = Ledger().latest(kind="baseline")
return rec["metrics"] if rec else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent a potential KeyError if the baseline record is missing the "metrics" key, use .get("metrics") instead of direct bracket access.

Suggested change
def _baseline_reference():
"""Most recent baseline record's aggregate metrics, for head-to-head printing."""
rec = Ledger().latest(kind="baseline")
return rec["metrics"] if rec else None
def _baseline_reference():
"""Most recent baseline record's aggregate metrics, for head-to-head printing."""
rec = Ledger().latest(kind="baseline")
return rec.get("metrics") if rec else None

Comment thread benchmark/plot.py
Comment on lines +106 to +107
records = Ledger().read()
record = next((r for r in records if r["experiment_id"] == experiment_id), None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use .get("experiment_id") instead of r["experiment_id"] to avoid raising a KeyError if any record in the ledger is missing the "experiment_id" key.

Suggested change
records = Ledger().read()
record = next((r for r in records if r["experiment_id"] == experiment_id), None)
records = Ledger().read()
record = next((r for r in records if r.get("experiment_id") == experiment_id), None)

Comment thread benchmark/plot.py
Comment on lines +110 to +112
artifacts = [
a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch"
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "artifacts" is explicitly set to null in the JSON ledger, record.get("artifacts", []) will return None, causing a TypeError during iteration. Using record.get("artifacts") or [] ensures it always defaults to an empty list.

Suggested change
artifacts = [
a for a in record.get("artifacts", []) if a.get("kind") == "flat_patch"
]
artifacts = [
a for a in (record.get("artifacts") or []) if a.get("kind") == "flat_patch"
]

Comment thread benchmark/report.py
Comment on lines +54 to +55
commit = (r.get("git", {}).get("commit") or "")[:8]
dirty = "*" if r.get("git", {}).get("dirty") else ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.

Suggested change
commit = (r.get("git", {}).get("commit") or "")[:8]
dirty = "*" if r.get("git", {}).get("dirty") else ""
git = r.get("git") or {}
commit = (git.get("commit") or "")[:8]
dirty = "*" if git.get("dirty") else ""

Comment thread benchmark/report.py
for r in records:
commit = (r.get("git", {}).get("commit") or "")[:8]
dirty = "*" if r.get("git", {}).get("dirty") else ""
m = r.get("metrics", {})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "metrics" is explicitly set to null in the JSON ledger, r.get("metrics", {}) will return None, causing an AttributeError when calling .get(). Using r.get("metrics") or {} ensures it always defaults to a dictionary.

Suggested change
m = r.get("metrics", {})
m = r.get("metrics") or {}

Comment thread benchmark/report.py
Comment on lines +75 to +79
git = r.get("git", {})
lines.append(
f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"
+ (" ⚠️ dirty tree" if git.get("dirty") else "")
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "git" is explicitly set to null in the JSON ledger, r.get("git", {}) will return None, causing an AttributeError when calling .get("commit"). Using r.get("git") or {} ensures it always defaults to a dictionary.

Suggested change
git = r.get("git", {})
lines.append(
f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"
+ (" ⚠️ dirty tree" if git.get("dirty") else "")
)
git = r.get("git") or {}
lines.append(
f"- **commit**: `{git.get('commit')}` ({git.get('branch')})"
+ (" ⚠️ dirty tree" if git.get("dirty") else "")
)

Comment thread benchmark/report.py
Comment on lines +80 to +84
env = r.get("environment", {})
lines.append(
f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "
f"python {env.get('python')}, host {env.get('hostname')}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "environment" is explicitly set to null in the JSON ledger, r.get("environment", {}) will return None, causing an AttributeError when calling .get(). Using r.get("environment") or {} ensures it always defaults to a dictionary.

Suggested change
env = r.get("environment", {})
lines.append(
f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "
f"python {env.get('python')}, host {env.get('hostname')}"
)
env = r.get("environment") or {}
lines.append(
f"- **env**: jax {env.get('jax_version')} [{env.get('jax_backend')}], "
f"python {env.get('python')}, host {env.get('hostname')}"
)

Comment thread benchmark/report.py
Comment on lines +89 to +90
decision = r.get("decision", {})
if decision.get("hypothesis"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If "decision" is explicitly set to null in the JSON ledger, r.get("decision", {}) will return None, causing an AttributeError when calling .get(). Using r.get("decision") or {} ensures it always defaults to a dictionary.

Suggested change
decision = r.get("decision", {})
if decision.get("hypothesis"):
decision = r.get("decision") or {}
if decision.get("hypothesis"):

Tutte flip-free init verified across 4 hemispheres / 2 subjects: equal quality
(14.75% vs 14.73% distortion, 121 vs 121 flips) at 37% less runtime. Ablations show
the final NAR + spring smoothing are essential (skipping them lowers the distortion
number to 13.2% but leaves ~21k flipped triangles — a folded map); only the initial
NAR is removable. LSCM init ~ Tutte.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mvdoc

mvdoc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Autoresearch results (CPU, public Narratives benchmark)

Tutte flip-free init — verified across 4 hemispheres / 2 subjects (sub-022, sub-026):

mean distortion total flips mean runtime
baseline (FS-clone) 14.73% 121 702 s
Tutte init 14.75% 121 442 s (−37%)

Equal quality (distortion +0.03pp, flips identical), ~37% faster. The Tutte map starts flip-free
(0 vs ~141k for the projection), so the initial negative-area-removal phase is unnecessary. Maps are
visually clean on both subjects.

Ablations — the final NAR and spring smoothing are NOT removable (a "numbers lie" result):
skipping the final NAR posts the lowest distortion in the study (13.21%) but leaves ~21000
flipped triangles
— a folded, invalid map. Only the initial NAR is removable. LSCM init ≈ Tutte.

Full provenance for all 9 experiments is in the ledger (NOTEBOOK.md / experiments.jsonl under
the data root). Curated summary added in benchmark/FINDINGS.md.

mvdoc and others added 2 commits June 10, 2026 05:21
Tests replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers on
the metric+area energy, from the Tutte init. Adds optax/jaxopt to the bench extra.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…the lever

L-BFGS/CG on the same energy fold (10.31% distortion but 58025 flips, 8x faster); the
area-weight sweep is Pareto-worse than baseline; the epoch schedule still folds; a flip
barrier keeps injectivity but stalls L-BFGS at the init. The FreeSurfer optimizer's value
is the multi-level gradient smoothing + NAR, not the line search. Productive next step is
multiscale/hierarchical (multigrid) optimization or SLIM, not a drop-in optimizer swap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mvdoc

mvdoc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Optimizer-algorithm experiments (answering "can we swap the optimizer?")

Tested replacing the FreeSurfer-style line-search GD with off-the-shelf optimizers (scipy
L-BFGS-B / CG) on the same metric+area energy, from the flip-free Tutte init.

  • L-BFGS folds: reaches 10.31% distortion in 83 s (~8× faster) but 58025 flipped triangles.
  • Pareto-worse than baseline: sweeping the area weight can't get low distortion and low flips
    (to match ~24 flips it needs ~25–29% distortion vs baseline 15%).
  • Epoch weight schedule doesn't help: still folds (71269 flips at the distance-dominant stage).
  • One-sided flip barrier: keeps injectivity (~0 flips) but L-BFGS then stalls at the init
    (33.7%) — local distance-reducing moves all push toward folding.

Conclusion: the FreeSurfer optimizer's value is the multi-level gradient smoothing (coherent
coarse-to-fine moves that avoid folding) + NAR, not the line-search algorithm. A drop-in optimizer
swap is a dead end on this energy; the productive directions are multiscale/hierarchical
(geometric-multigrid) optimization
or a proper SLIM local-global solver. Details in
benchmark/FINDINGS.md; all runs in the ledger (10 experiments).

mvdoc and others added 18 commits June 10, 2026 05:45
Parametrizes the flatmap in the low-frequency cotangent-Laplacian eigenbasis and
optimizes coarse-to-fine: band-limited maps can't fold, so the coarse solve is flip-free
by construction (19 flips, 66s, 22% distortion vs L-BFGS's 58k flips). Used as a coarse
init for full-resolution refinement (geometric multigrid V-cycle). Harmonics cached per
mesh by content hash.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Spectral (manifold-harmonic) multigrid: coarse solve is flip-free by band-limiting (19
flips, 66s, 22%) but band-limited; full V-cycle matches quality (15.3%) but isn't faster
(eigsh+spectral overhead). Adam (fixed or per-level-scheduled lr) folds or stalls -- a
fixed step can't span the multiscale step range. Conclusion: nothing beats the FreeSurfer
multiscale line-search on this energy; the line search is the load-bearing piece. Real
wins need mesh-decimation multigrid or SLIM, not a drop-in optimizer swap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…othing cap)

Profiling shows per-iteration cost is gradient (177ms) + line search (272ms) + smoothing
(up to 902ms at n_avg=1024). Expose --n-coarse-steps, --iters-per-level, --max-smoothing
to test pure runtime reductions that don't change the energy.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Profiling: per-iteration cost is gradient (177ms) + line search (272ms) + smoothing
(up to 902ms at n_avg=1024). Independent, compounding speed levers measured on sub-022 lh:
line search 15->7 points (-30%, no quality cost), k-ring neighbors 12->6 (-33%, +0.1pp),
fewer iters (-20%), capped smoothing (-7%). Combined fast config: 262s vs 658s original
baseline (~60% faster) at +0.31pp distortion. The practical path to speed is config
levers + Tutte init, not a new optimizer.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stacked fast config across 4 hemispheres: 194s vs 702s baseline (-72%) at 15.10% vs
14.73% distortion, visually clean maps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…calibration

Built a heat-method true-geodesic distortion yardstick; it shows the k-ring metric
overstates distortion (~16% true vs 33.7% k-ring for Tutte) and mis-ranks maps. k-ring
targets are ~9% inflated locally, BUT recalibrating the 1.207 correction down makes true
distortion monotonically worse (18.15% -> 21.66% -> 26.88% for 1.207/1.10/1.00): the
factor usefully pre-stretches to compensate curvature-induced medium-range compression.
Default is near-optimal; recalibration is a dead end. Real remaining lever: per-edge
geodesic targets in the energy (heavier).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…put scale

Adds benchmark/truedist.py (energy-independent true-geodesic distortion, banded +
global) and benchmark/probe_truedist.py (k_ring / epoch-skip / target-scale sweeps
scored on the true metric). FINDINGS section 9:
- targets are a confirmed dead end (Dijkstra over-estimates geodesic only ~7.8%,
  flat across distance; current target=Dijkstra/1.207 is ~11% too compact; 1.207
  near-optimal).
- k_ring 7->11 does not help (more flips, slower).
- the local <=30mm metric is gameable: a conformal Tutte disk wins it but is a
  degenerate flatmap; the global all-pairs metric correctly ranks the full pipeline best.
- validated near-free win: the final scale_to_area normalization is not
  distance-optimal; rescaling output by ~1.04 cuts global geodesic distortion ~0.5pp
  (~5% rel), reproduced on held-out geodesic sources (single-hemi; multi-hemi TBD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… term)

Read the original Fischl/Sereno/Dale 1999 energy: J = lambda_d J_d + lambda_a J_a,
where J_a is gated to negative-area (folded) triangles only -> fold removal, NOT area
preservation. Objective is purely metric (distance) distortion; area is a dependent
byproduct. scale_to_area is a display convention, not part of the objective. Corrects
section 9(d): the ~1.04 scale gain is real on true geodesics but is a symptom of the
target-compaction bias, and the k-ring surrogate prefers the opposite scale; not a
clean free win.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… on global

Adds true_distortion_full (local<=R, global all-pairs, distance-optimal scale) and a
build_truegeo runner. Target-scale sweep on sub-022 lh scored on the GLOBAL metric:
target_scale=1.10 (effective correction ~1.10, near the measured Dijkstra/geodesic
ratio) cuts global true distortion 11.80->11.15% (-0.65pp, ~5.5% rel), flips unchanged.
Reverses section 8's local-metric "dead end": on the faithful global metric, calibrating
the correction toward true geodesics helps. Cross-hemisphere validation pending.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eep)

Section 10: re-running the correction sweep scored on the GLOBAL true metric (not the
gameable local one of section 8) reverses section 8's "dead end" -- larger targets
(eff. correction ~1.10, toward the measured Dijkstra/geodesic ratio ~1.08) reduce global
true distortion -- but the win is subject-dependent: clear on sub-022 (both hemis, -0.1
to -0.5pp), neutral/slightly-worse on sub-026 lh. Flips stay comparable. Robust component
is the output-scale fix (s*~1.02 everywhere). A larger robust gain needs long-range
geodesic anchors in the energy (next step, pending scope decision).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…a robust default

Section 11: extended the correction sweep to 8 hemispheres across 7 subjects. Two robust
negatives: (1) the local Dijkstra/geodesic ratio is nearly subject-invariant (~1.08) and
does NOT predict the optimal correction, so per-subject auto-calibration won't work; (2) no
correction change is universally safe -- 5/8 hemis want ~1.10, 3 prefer the original 1.207,
and a milder 1.05 doesn't rescue the non-helpers (sub-075 is worse). Mean gain only -0.19pp
with per-subject downside up to +0.35pp. Config-lever avenue exhausted; the only consistently
safe op is the distance-optimal output scale (small); a larger robust gain needs long-range
geodesic anchors in the energy. Adds 5 validation subjects (lh) to the manifest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-up)

Adds a paper-facing executive summary and updates the next-ideas list to reflect the
exhausted config-lever search: long-range geodesic anchors remain the one open lever for
a larger robust distance-error reduction; distance-optimal output scale is the small safe
ship; speed defaults (Tutte init, lean levers) are validated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-better quality

Add benchmark/validate_speed.py and FINDINGS section 12. The speed levers were
tuned on a single hemisphere (sub-022 lh); this re-validates the stacked
fast_ultimate config across all 9 manifest hemispheres, grouped tuned/seen/
held-out, scoring quality with the energy-independent global true-geodesic
metric (not the incomparable raw k-ring number).

Result: 3.36x mean speedup [3.12-3.60], held-out (3.34x) matches tuned (3.42x)
=> not overfit. Corrects section 7's "+0.37pp worse" (a k-ring metric artifact):
on the faithful metric the fast config is -0.73pp BETTER on every scored
hemisphere. 10 ledger records logged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-vertex angular-sampling loop in compute_kring_geodesic_distances_angular
was serial Python over all vertices (~4 min/hemi, the one-time cache-build cost),
while 31 of 32 cores sat idle. Fold tangent-plane projection + per-ring angular
sampling + limited Dijkstra into a single @njit(parallel=True) prange kernel
(_angular_kring_kernel) plus an njit port of the angular sampler
(_select_angular_samples_njit).

Output is bit-identical to the previous numba production path: verified against
the existing sub-022 lh cache (0/193174 neighbor-set mismatches, 0 distance
mismatches), and pinned by a new regression test
(TestAngularKernelParity). Steady-state runtime for sub-022 lh drops from
~230s to 8.7s (~26x); ~17s including cold JIT. The serial loop is kept as the
use_numba=False fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… cold)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reimplement the projection stage's only FreeSurfer dependency
(mri_label2label --regmethod surface) as a pure-Python union(push, pull)
KDTree mapping on sphere.reg. Validated three ways:

- vs real mri_label2label (FS 6.0): 0 vertex-ID differences across 6 cuts
  x 3 hemispheres (validate_mapping_vs_freesurfer.py)
- end-to-end patch bit-identical on 9/9 manifest hemispheres incl. all 5
  held-out subjects (validate_projection.py)
- unit invariants, no FreeSurfer (tests/test_projection.py)

Mapping step: 36.6s (FreeSurfer subprocess) -> ~0.3s (KDTree). Projection
no longer requires FreeSurfer. Documented in FINDINGS section 14; ledger
records logged. Phase 2 (cut-placement quality, scored by downstream
flatmap distortion) is next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add benchmark/probe_refinement.py to test whether projection refinement
(continuity + geodesic) improves downstream flatmap distortion, scored by
the fast flattener + global true-geodesic metric (fresh k-ring + truegeo
per variant). Add a `continuity` toggle to project_python for the ablation.

First signal (sub-022 lh): geodesic refinement gives HIGHER global
distortion (11.37%) than raw mapped cuts (10.69%); all variants are
topologically valid. N=1, <1pp spread — broadening before concluding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ontinuity

Ablation across 6 hemispheres (probe_refinement.py), scored by downstream
fast-flatten + global true-geodesic distortion:

  geodesic (shipped)   11.12%   worst/tied, inflates flips
  continuity_only      10.84%   BEST (-0.28pp), fewer flips
  mapped_only          11.12%   continuity is what helps, not thinning
  geodesic_curv (c)    11.24%   worst; flips exploded (424 on sub-022 rh)

The geodesic refinement does not help and slightly hurts: connecting cut
components (continuity) is the win; thinning the thick mapped cut to a
1-vertex geodesic is the harm (removes strain relief, jagged boundary).
Curvature-weighted routing (c) failed because sulci meander -> more
boundary tortuosity -> more flips, confirming thinning (not the routing
metric) is the core problem.

Adds a `continuity` toggle and `refine_weight=curvature` (curv_alpha) path
to project_python. Shippable recommendation: default projection to
geodesic-refinement-off. Documented in FINDINGS section 15.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mvdoc mvdoc marked this pull request as draft June 11, 2026 17:53
mvdoc and others added 7 commits June 11, 2026 12:25
Tested the boundary-smoothing follow-on: keep the thick continuity_only
cut but morphologically close its boundary (project_python morph_close=n;
_morphological_close_cuts). Result across 6 hemispheres:

  continuity_only   10.84%   BEST (robust winner)
  thick_close2      10.99%   reduces worst flips (59->33) but distortion mixed
  thick_close1      11.01%   same; hurts sub-026 lh by +1pp

Both "smarter" refinements (curvature routing (c), boundary smoothing (2))
fail to beat plain continuity_only. The simplest answer wins: connect the
cut components, don't thin, don't smooth. Coverage cost of closing is
negligible (<150 verts, <0.1% of surface; medial wall ~11k dominates).
Adds --save-flat to the probe for visualization. FINDINGS section 15 updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… O(n^2))

The non-angular k-ring builder allocated three O(n_vertices) arrays inside a
per-vertex prange and collected results with an O(n_vertices) scan per
vertex. numba's parallel analysis refuses to parallelize a loop with
per-iteration allocations, so a fresh compile ran serially -> O(n^2),
~16 min on a 193k-vertex mesh (a cached build had masked this).

Rewrite to parallelize over chunks: split vertices into n_chunks blocks,
prange over chunks, each owns its scratch (allocated once) and resets only
touched entries; collect each ring from the touched list and sort to match
the reference output. Identical output (matches pure-Python get_k_ring),
now ~12s and multi-core on 193k verts. Used by plotting and the uncached
non-angular flatten path. 313 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_kring_geodesic_distances ran the per-vertex limited Dijkstra in a
serial Python list comprehension (~16s on 193k verts). Add a chunk-parallel
kernel mirroring _limited_dijkstra_numba with per-chunk scratch (allocated
once, only touched entries reset), producing distances in the flat k-ring
layout. Bit-exact vs the serial path (max diff 0.0 over 3000 sampled verts),
multi-core. Exposes get_k_ring_fast_flat to share the flat ring layout.

End-to-end flatmap plot: 43s -> 27s. Helps the uncached non-angular flatten
path too. 313 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the 193k-iteration Python loop with a flattened numpy computation
(concatenate neighbor lists, segment-sum via bincount). Per-vertex distortion
== 100 * sum_abs / sum_target since the neighbor count cancels. Matches the
old loop (max per-vertex diff 8.5e-14). The loop is now ~0.2s.

With the parallel k-ring (ring builder + distances), end-to-end flatmap plot
is ~22.6s (was 43s). 313 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The flatmap distortion overlay now (by default in plot_flatmap):
- rescales by the distance-optimal global scale s* before scoring, so a
  global scale offset no longer inflates the per-vertex map (shows true
  local shape distortion; s* is reported in the panel title);
- shows SIGNED relative distortion with a diverging colormap (RdBu_r):
  + = flatmap stretched, - = compressed, instead of magnitude-only.

compute_kring_distortion gains optimal_scale/signed/return_opt_scale flags;
defaults preserve the original magnitude/raw-scale behavior (viz tests and
the scaling-sensitivity test unchanged). 313 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l (default on)

The flattener's final scale was area-matched (s = sqrt(orig_area/total_area)),
a display convention that leaves the map ~6% too small vs true geodesic
distances (validated: opt scale 1.063 +/- 0.008 across 6 hemispheres,
expand). Add distance_optimal_scale: after optimization, rescale the output
by the single global scale that minimizes true-geodesic distortion, computed
from a deterministic heat-method geodesic sample on the fiducial patch. So
the surface and the flat map are directly comparable (the use case: viewing
inflated + flattened together).

New DistanceOptimalScaleConfig (enabled by default, n_sources=200, seed=0);
distance_optimal_scale() in distance.py; applied in SurfaceFlattener.run()
before final stats. Deterministic (fixed seed; heat method is deterministic).
End-to-end: s=1.055 on sub-022 lh, true distortion 13.87% -> 13.23% (-0.64pp).

viz.plot_flatmap now scores at the true (already-faithful) output scale
instead of re-normalizing by the gameable local k-ring optimum. 313 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ated)

Document the shipped distance-optimal output scale: true-geodesic optimum
1.063 +/- 0.008 (expand) across 6 hemispheres, lowers global distortion on
every one; replaces area-matching as the default. Note the sign correction
(local k-ring optimum points the wrong way, ~0.95 shrink, due to target
compaction). Update executive summary + Next-ideas (marked shipped).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant