Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration#112
Merged
Merged
Conversation
…ocol bump) Add an additive ``operating_point_confidence`` to ``ScoreReport.to_json``'s per-class payload — the confidence cut admitted within the false-alarm budget at the headline operating point, the per-class operating point an uncertainty-calibration pass publishes. It was already computed on ``ClassMetrics`` in-memory (the v0.2 report kept it off the serialised artifact); persisting it changes the frozen golden, so it is a v0.3-boundary change recorded as decision D17. - benchmark/scoring.py: serialise the field; benchmark/metrics.py: refresh the now-stale "in-memory only" docstring. - Regenerate the committed scorer golden (tests/data/benchmark/scores.json) — purely additive, every prior field byte-for-byte unchanged. - Lock the leaderboard tolerance: the Space re-scores live and its public response is a strict aggregate subset, so the new field is inert there (new regression test). Part of the v0.3 release-cut work for #109.
The learned-baseline training drivers and the leaderboard fixture builder still pointed at dataset/v0.2 and stamped dataset_version 0.2.0, predating the IGSO growth. Track the bundled catalogue version instead of hard-coding the path/stamp, so the committed labels/splits and the reconstruction recipe stay in lockstep (a future catalogue bump repoints them without a driver edit). - train_bilstm_real.py, train_transformer_real.py: derive _DATA from DATASET_VERSION (currently v0.3), stamp the bundle's dataset_version from it, and default the selection objective to `macro` (the class-balanced retrain the v0.3 baselines run under). - build_fixture.py: derive _DATA from DATASET_VERSION the same way. - De-peg the version numbers from docstrings/prints so they don't rot at the next bump. Part of the v0.3 release-cut work for #109.
Wire the uncertainty-calibration machinery into the published artifacts so every published detector emits calibrated confidence with no calibration data at load. - calibration.py: add BundledCalibration — the serialisable, val-fit calibration baked into a bundle (a pooled temperature, the conformal predictor, and per-orbit-class reliability + ECE). `fit` pools the per-class val samples for the calibrator and measures reliability/ECE on the calibrated confidences; an empty (sparse) class rides through with a zero ECE. Add an `apply_calibration` helper and route CalibratedDetector through it. - checkpoint.py / foundation.py: add an optional `calibration` slot to ModelBundle and FoundationBundle, round-tripped through save/load; a bundle saved before the slot loads as None (back-compatible). - learned.py / detectors/foundation.py: adopt the bundle's calibrator on load and remap the emitted confidence in detect() when one is present. - Tests: BundledCalibration fit/round-trip/sparse-class, bundle save/load round-trip + back-compat, and end-to-end calibrated inference for both detector families. Part of the v0.3 release-cut work for #109.
Apply the baked-in calibration end to end: fit it in the offline drivers, render it on the model cards, and document it in the benchmark protocol. - evaluate.py: add fit_calibration_on_val — the per-class val samples in, a BundledCalibration out (the calibrator the bundle ships). - Drivers: train_bilstm_real.py / train_transformer_real.py fit the calibration on val and bake it before scoring test (so the test report's operating point is in calibrated units); the foundation calibrate_and_score does the same between threshold calibration and scoring, threading the calibrator through the scored detector. A val split with no matched detection ships uncalibrated rather than failing. - Model cards: a per-class "Operating pt" column (the calibrated confidence cut) in the test table, plus a calibration section (temperature, conformal coverage, per-class ECE), shared by the torch and foundation cards. The IGSO row renders automatically. - calibration.py: add format_reliability_curve — a committed-data-free text reliability diagram rendered straight from a bundle; documented in the benchmark protocol's calibration section. - Tests cover the driver baking, card rendering, and the render helper. Part of the v0.3 release-cut work for #109.
The per-class operating point is now persisted into the scoring JSON; assert it is present (was asserted absent under the v0.2 in-memory-only contract). Part of the v0.3 release-cut work for #109.
… code Pure formatting: reflow docstrings/strings and let ruff format wrap the new call sites; no behaviour change. Part of the v0.3 release-cut work for #109.
On a sparse / poorly-separated val split the BCE-optimal temperature can collapse toward the clamp bound and merely flatten confidence toward the base rate — monotonic, so recall/precision are unchanged, but it distorts the confidence column rather than calibrating it. Guard it: BundledCalibration.fit keeps the fitted temperature only when it reduces the pooled val ECE, otherwise it ships identity (T=1, raw confidence) — a detector that cannot be meaningfully calibrated emits its raw confidence. Also clamp the exponent in the internal sigmoid so a near-separable fit no longer raises a benign exp-overflow RuntimeWarning (the sigmoid has already saturated; the value is unchanged). Tests: identity fallback on a perfectly-calibrated separable sample, and the do-no-harm invariant (shipped ECE never exceeds raw). Part of the v0.3 release-cut work for #109.
…ollapse) The benchmark ranks detections by confidence, so a monotonic calibration must leave recall and precision unchanged. It didn't: the logit clamp pinned every confidence above 1 - 1e-6 to the same value, collapsing distinct *saturating* confidences into a tie. The forecast-residual detector's confidence is 1 - exp(-z/threshold), which exceeds that for any strong detection, so its strong true positives and false alarms tied — and the scorer's deterministic false-alarm- first tie-break then admitted the tied false alarms first, exhausting a small class's false-alarm budget before any true positive and zeroing its recall (observed: foundation LEO 0.49 -> 0.00). Clamp the logit only at exactly 0/1 (where it is infinite), via nextafter, so every distinct interior confidence keeps a distinct logit and calibration never introduces a benchmark tie that raw confidence did not already have. Regression tests: recall is invariant to a monotonic calibration even with saturating, tied-prone confidences (reproduces the collapse under the old clamp), plus order-preservation at the unit level. Part of the v0.3 release-cut work for #109.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the commit-able code half of the v0.3 release-cut work, so the credentialed retrains/republish at the cut just produce the numbers — no code left to write. Four pieces:
ScoreReport.to_jsonnow persists the per-classoperating_point_confidence(the confidence cut at the headline false-alarm rate) as an additive field; it was already computed onClassMetricsin-memory. The committed scorer golden is regenerated (purely additive), a decision record D17 is added, and a regression test locks that the public leaderboard ignores the new field.dataset/v0.3, stamping0.3.0) instead of hard-coding v0.2, and default the selection objective tomacro.BundledCalibration(val-fit temperature + conformal predictor + per-orbit-class reliability/ECE) is stored in both bundle types and applied at inference, so every published detector emits calibrated confidence with no calibration data at load. Old bundles without it load unchanged.format_reliability_curverender helper.Out of scope — manual release-cut steps (you run these)
The credentialed runs can't run here (no GPU; checkpoints are gitignored, so they produce no git diff). At the release cut, on a GPU box with Space-Track +
HF_TOKEN:bilstm-base/transformer-baseon v0.3 (examples/train_*_real.py, nowmacro+ calibrated).calibrate-foundationforchronos-residualon v0.3.leaderboard/build_fixture.py).Per-class foundation-gate refinement, dataset/label changes, and HEO remain out of scope.
Test plan
uv run pytest— full suite green (695 passed)uv run ruff check/uv run ruff format --checkuv run mypyCalibration wiring is exercised with synthetic / stand-in inputs only; the real per-class numbers (including IGSO) come from the credentialed runs above.
Closes #109