Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration by djankov · Pull Request #112 · astro-tools/maneuver-detect

djankov · 2026-06-15T18:33:03Z

Summary

Lands the commit-able code half of the v0.3 release-cut work, so the credentialed retrains/republish at the cut just produce the numbers — no code left to write. Four pieces:

Score-JSON protocol bump (D17). ScoreReport.to_json now persists the per-class operating_point_confidence (the confidence cut at the headline false-alarm rate) as an additive field; it was already computed on ClassMetrics in-memory. The committed scorer golden is regenerated (purely additive), a decision record D17 is added, and a regression test locks that the public leaderboard ignores the new field.
v0.3 repoints. The learned-baseline training drivers and the leaderboard fixture builder now track the bundled catalogue version (currently dataset/v0.3, stamping 0.3.0) instead of hard-coding v0.2, and default the selection objective to macro.
Baked-in uncertainty calibration. A new BundledCalibration (val-fit temperature + conformal predictor + per-orbit-class reliability/ECE) is stored in both bundle types and applied at inference, so every published detector emits calibrated confidence with no calibration data at load. Old bundles without it load unchanged.
Publish it. The offline drivers fit the calibration on the val split (val-only, no test leakage) and bake it before scoring test; the model cards gain a calibrated per-class operating-point column and a calibration section (IGSO row renders automatically); the benchmark docs document it with a committed-data-free format_reliability_curve render helper.

Out of scope — manual release-cut steps (you run these)

The credentialed runs can't run here (no GPU; checkpoints are gitignored, so they produce no git diff). At the release cut, on a GPU box with Space-Track + HF_TOKEN:

Retrain bilstm-base / transformer-base on v0.3 (examples/train_*_real.py, now macro + calibrated).
Re-run calibrate-foundation for chronos-residual on v0.3.
Republish the HF checkpoints/bundles + regenerated cards (IGSO row, calibrated confidence), lockstep with the v0.3 tag.
Re-seed the public leaderboard (leaderboard/build_fixture.py).

Per-class foundation-gate refinement, dataset/label changes, and HEO remain out of scope.

Test plan

uv run pytest — full suite green (695 passed)
uv run ruff check / uv run ruff format --check
uv run mypy
Credentialed retrains + foundation re-calibrate + HF republish + leaderboard re-seed (manual, at the release cut)

Calibration wiring is exercised with synthetic / stand-in inputs only; the real per-class numbers (including IGSO) come from the credentialed runs above.

Closes #109

…ocol bump) Add an additive ``operating_point_confidence`` to ``ScoreReport.to_json``'s per-class payload — the confidence cut admitted within the false-alarm budget at the headline operating point, the per-class operating point an uncertainty-calibration pass publishes. It was already computed on ``ClassMetrics`` in-memory (the v0.2 report kept it off the serialised artifact); persisting it changes the frozen golden, so it is a v0.3-boundary change recorded as decision D17. - benchmark/scoring.py: serialise the field; benchmark/metrics.py: refresh the now-stale "in-memory only" docstring. - Regenerate the committed scorer golden (tests/data/benchmark/scores.json) — purely additive, every prior field byte-for-byte unchanged. - Lock the leaderboard tolerance: the Space re-scores live and its public response is a strict aggregate subset, so the new field is inert there (new regression test). Part of the v0.3 release-cut work for #109.

The learned-baseline training drivers and the leaderboard fixture builder still pointed at dataset/v0.2 and stamped dataset_version 0.2.0, predating the IGSO growth. Track the bundled catalogue version instead of hard-coding the path/stamp, so the committed labels/splits and the reconstruction recipe stay in lockstep (a future catalogue bump repoints them without a driver edit). - train_bilstm_real.py, train_transformer_real.py: derive _DATA from DATASET_VERSION (currently v0.3), stamp the bundle's dataset_version from it, and default the selection objective to `macro` (the class-balanced retrain the v0.3 baselines run under). - build_fixture.py: derive _DATA from DATASET_VERSION the same way. - De-peg the version numbers from docstrings/prints so they don't rot at the next bump. Part of the v0.3 release-cut work for #109.

Wire the uncertainty-calibration machinery into the published artifacts so every published detector emits calibrated confidence with no calibration data at load. - calibration.py: add BundledCalibration — the serialisable, val-fit calibration baked into a bundle (a pooled temperature, the conformal predictor, and per-orbit-class reliability + ECE). `fit` pools the per-class val samples for the calibrator and measures reliability/ECE on the calibrated confidences; an empty (sparse) class rides through with a zero ECE. Add an `apply_calibration` helper and route CalibratedDetector through it. - checkpoint.py / foundation.py: add an optional `calibration` slot to ModelBundle and FoundationBundle, round-tripped through save/load; a bundle saved before the slot loads as None (back-compatible). - learned.py / detectors/foundation.py: adopt the bundle's calibrator on load and remap the emitted confidence in detect() when one is present. - Tests: BundledCalibration fit/round-trip/sparse-class, bundle save/load round-trip + back-compat, and end-to-end calibrated inference for both detector families. Part of the v0.3 release-cut work for #109.

Apply the baked-in calibration end to end: fit it in the offline drivers, render it on the model cards, and document it in the benchmark protocol. - evaluate.py: add fit_calibration_on_val — the per-class val samples in, a BundledCalibration out (the calibrator the bundle ships). - Drivers: train_bilstm_real.py / train_transformer_real.py fit the calibration on val and bake it before scoring test (so the test report's operating point is in calibrated units); the foundation calibrate_and_score does the same between threshold calibration and scoring, threading the calibrator through the scored detector. A val split with no matched detection ships uncalibrated rather than failing. - Model cards: a per-class "Operating pt" column (the calibrated confidence cut) in the test table, plus a calibration section (temperature, conformal coverage, per-class ECE), shared by the torch and foundation cards. The IGSO row renders automatically. - calibration.py: add format_reliability_curve — a committed-data-free text reliability diagram rendered straight from a bundle; documented in the benchmark protocol's calibration section. - Tests cover the driver baking, card rendering, and the render helper. Part of the v0.3 release-cut work for #109.

The per-class operating point is now persisted into the scoring JSON; assert it is present (was asserted absent under the v0.2 in-memory-only contract). Part of the v0.3 release-cut work for #109.

… code Pure formatting: reflow docstrings/strings and let ruff format wrap the new call sites; no behaviour change. Part of the v0.3 release-cut work for #109.

On a sparse / poorly-separated val split the BCE-optimal temperature can collapse toward the clamp bound and merely flatten confidence toward the base rate — monotonic, so recall/precision are unchanged, but it distorts the confidence column rather than calibrating it. Guard it: BundledCalibration.fit keeps the fitted temperature only when it reduces the pooled val ECE, otherwise it ships identity (T=1, raw confidence) — a detector that cannot be meaningfully calibrated emits its raw confidence. Also clamp the exponent in the internal sigmoid so a near-separable fit no longer raises a benign exp-overflow RuntimeWarning (the sigmoid has already saturated; the value is unchanged). Tests: identity fallback on a perfectly-calibrated separable sample, and the do-no-harm invariant (shipped ECE never exceeds raw). Part of the v0.3 release-cut work for #109.

…ollapse) The benchmark ranks detections by confidence, so a monotonic calibration must leave recall and precision unchanged. It didn't: the logit clamp pinned every confidence above 1 - 1e-6 to the same value, collapsing distinct *saturating* confidences into a tie. The forecast-residual detector's confidence is 1 - exp(-z/threshold), which exceeds that for any strong detection, so its strong true positives and false alarms tied — and the scorer's deterministic false-alarm- first tie-break then admitted the tied false alarms first, exhausting a small class's false-alarm budget before any true positive and zeroing its recall (observed: foundation LEO 0.49 -> 0.00). Clamp the logit only at exactly 0/1 (where it is infinite), via nextafter, so every distinct interior confidence keeps a distinct logit and calibration never introduces a benchmark tie that raw confidence did not already have. Regression tests: recall is invariant to a monotonic calibration even with saturating, tied-prone confidences (reproduces the collapse under the old clamp), plus order-preservation at the unit level. Part of the v0.3 release-cut work for #109.

djankov added 8 commits June 15, 2026 13:51

Flip the operating-point serialisation test for the v0.3 protocol bump

9ce7887

The per-class operating point is now persisted into the scoring JSON; assert it is present (was asserted absent under the v0.2 in-memory-only contract). Part of the v0.3 release-cut work for #109.

Wrap long lines to satisfy ruff line-length over the v0.3 calibration…

8f70c79

… code Pure formatting: reflow docstrings/strings and let ruff format wrap the new call sites; no behaviour change. Part of the v0.3 release-cut work for #109.

djankov marked this pull request as ready for review June 15, 2026 22:27

djankov merged commit 4440a73 into main Jun 15, 2026
25 of 26 checks passed

djankov deleted the issue-109-reeval-baselines-v03-igso branch June 15, 2026 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration#112

Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration#112
djankov merged 8 commits into
mainfrom
issue-109-reeval-baselines-v03-igso

djankov commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

djankov commented Jun 15, 2026

Summary

Out of scope — manual release-cut steps (you run these)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant