Skip to content

Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration#112

Merged
djankov merged 8 commits into
mainfrom
issue-109-reeval-baselines-v03-igso
Jun 15, 2026
Merged

Retrain + re-eval baselines on v0.3 (IGSO): protocol bump + baked-in calibration#112
djankov merged 8 commits into
mainfrom
issue-109-reeval-baselines-v03-igso

Conversation

@djankov

@djankov djankov commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Lands the commit-able code half of the v0.3 release-cut work, so the credentialed retrains/republish at the cut just produce the numbers — no code left to write. Four pieces:

  • Score-JSON protocol bump (D17). ScoreReport.to_json now persists the per-class operating_point_confidence (the confidence cut at the headline false-alarm rate) as an additive field; it was already computed on ClassMetrics in-memory. The committed scorer golden is regenerated (purely additive), a decision record D17 is added, and a regression test locks that the public leaderboard ignores the new field.
  • v0.3 repoints. The learned-baseline training drivers and the leaderboard fixture builder now track the bundled catalogue version (currently dataset/v0.3, stamping 0.3.0) instead of hard-coding v0.2, and default the selection objective to macro.
  • Baked-in uncertainty calibration. A new BundledCalibration (val-fit temperature + conformal predictor + per-orbit-class reliability/ECE) is stored in both bundle types and applied at inference, so every published detector emits calibrated confidence with no calibration data at load. Old bundles without it load unchanged.
  • Publish it. The offline drivers fit the calibration on the val split (val-only, no test leakage) and bake it before scoring test; the model cards gain a calibrated per-class operating-point column and a calibration section (IGSO row renders automatically); the benchmark docs document it with a committed-data-free format_reliability_curve render helper.

Out of scope — manual release-cut steps (you run these)

The credentialed runs can't run here (no GPU; checkpoints are gitignored, so they produce no git diff). At the release cut, on a GPU box with Space-Track + HF_TOKEN:

  • Retrain bilstm-base / transformer-base on v0.3 (examples/train_*_real.py, now macro + calibrated).
  • Re-run calibrate-foundation for chronos-residual on v0.3.
  • Republish the HF checkpoints/bundles + regenerated cards (IGSO row, calibrated confidence), lockstep with the v0.3 tag.
  • Re-seed the public leaderboard (leaderboard/build_fixture.py).

Per-class foundation-gate refinement, dataset/label changes, and HEO remain out of scope.

Test plan

  • uv run pytest — full suite green (695 passed)
  • uv run ruff check / uv run ruff format --check
  • uv run mypy
  • Credentialed retrains + foundation re-calibrate + HF republish + leaderboard re-seed (manual, at the release cut)

Calibration wiring is exercised with synthetic / stand-in inputs only; the real per-class numbers (including IGSO) come from the credentialed runs above.

Closes #109

djankov added 8 commits June 15, 2026 13:51
…ocol bump)

Add an additive ``operating_point_confidence`` to ``ScoreReport.to_json``'s per-class
payload — the confidence cut admitted within the false-alarm budget at the headline
operating point, the per-class operating point an uncertainty-calibration pass publishes.
It was already computed on ``ClassMetrics`` in-memory (the v0.2 report kept it off the
serialised artifact); persisting it changes the frozen golden, so it is a v0.3-boundary
change recorded as decision D17.

- benchmark/scoring.py: serialise the field; benchmark/metrics.py: refresh the now-stale
  "in-memory only" docstring.
- Regenerate the committed scorer golden (tests/data/benchmark/scores.json) — purely
  additive, every prior field byte-for-byte unchanged.
- Lock the leaderboard tolerance: the Space re-scores live and its public response is a
  strict aggregate subset, so the new field is inert there (new regression test).

Part of the v0.3 release-cut work for #109.
The learned-baseline training drivers and the leaderboard fixture builder still pointed at
dataset/v0.2 and stamped dataset_version 0.2.0, predating the IGSO growth. Track the bundled
catalogue version instead of hard-coding the path/stamp, so the committed labels/splits and
the reconstruction recipe stay in lockstep (a future catalogue bump repoints them without a
driver edit).

- train_bilstm_real.py, train_transformer_real.py: derive _DATA from DATASET_VERSION
  (currently v0.3), stamp the bundle's dataset_version from it, and default the selection
  objective to `macro` (the class-balanced retrain the v0.3 baselines run under).
- build_fixture.py: derive _DATA from DATASET_VERSION the same way.
- De-peg the version numbers from docstrings/prints so they don't rot at the next bump.

Part of the v0.3 release-cut work for #109.
Wire the uncertainty-calibration machinery into the published artifacts so every published
detector emits calibrated confidence with no calibration data at load.

- calibration.py: add BundledCalibration — the serialisable, val-fit calibration baked into a
  bundle (a pooled temperature, the conformal predictor, and per-orbit-class reliability +
  ECE). `fit` pools the per-class val samples for the calibrator and measures reliability/ECE
  on the calibrated confidences; an empty (sparse) class rides through with a zero ECE. Add
  an `apply_calibration` helper and route CalibratedDetector through it.
- checkpoint.py / foundation.py: add an optional `calibration` slot to ModelBundle and
  FoundationBundle, round-tripped through save/load; a bundle saved before the slot loads as
  None (back-compatible).
- learned.py / detectors/foundation.py: adopt the bundle's calibrator on load and remap the
  emitted confidence in detect() when one is present.
- Tests: BundledCalibration fit/round-trip/sparse-class, bundle save/load round-trip +
  back-compat, and end-to-end calibrated inference for both detector families.

Part of the v0.3 release-cut work for #109.
Apply the baked-in calibration end to end: fit it in the offline drivers, render it on the
model cards, and document it in the benchmark protocol.

- evaluate.py: add fit_calibration_on_val — the per-class val samples in, a BundledCalibration
  out (the calibrator the bundle ships).
- Drivers: train_bilstm_real.py / train_transformer_real.py fit the calibration on val and
  bake it before scoring test (so the test report's operating point is in calibrated units);
  the foundation calibrate_and_score does the same between threshold calibration and scoring,
  threading the calibrator through the scored detector. A val split with no matched detection
  ships uncalibrated rather than failing.
- Model cards: a per-class "Operating pt" column (the calibrated confidence cut) in the test
  table, plus a calibration section (temperature, conformal coverage, per-class ECE), shared by
  the torch and foundation cards. The IGSO row renders automatically.
- calibration.py: add format_reliability_curve — a committed-data-free text reliability diagram
  rendered straight from a bundle; documented in the benchmark protocol's calibration section.
- Tests cover the driver baking, card rendering, and the render helper.

Part of the v0.3 release-cut work for #109.
The per-class operating point is now persisted into the scoring JSON; assert it is present
(was asserted absent under the v0.2 in-memory-only contract).

Part of the v0.3 release-cut work for #109.
… code

Pure formatting: reflow docstrings/strings and let ruff format wrap the new call sites; no
behaviour change.

Part of the v0.3 release-cut work for #109.
On a sparse / poorly-separated val split the BCE-optimal temperature can collapse toward the
clamp bound and merely flatten confidence toward the base rate — monotonic, so recall/precision
are unchanged, but it distorts the confidence column rather than calibrating it. Guard it:
BundledCalibration.fit keeps the fitted temperature only when it reduces the pooled val ECE,
otherwise it ships identity (T=1, raw confidence) — a detector that cannot be meaningfully
calibrated emits its raw confidence.

Also clamp the exponent in the internal sigmoid so a near-separable fit no longer raises a
benign exp-overflow RuntimeWarning (the sigmoid has already saturated; the value is unchanged).

Tests: identity fallback on a perfectly-calibrated separable sample, and the do-no-harm
invariant (shipped ECE never exceeds raw).

Part of the v0.3 release-cut work for #109.
…ollapse)

The benchmark ranks detections by confidence, so a monotonic calibration must leave recall
and precision unchanged. It didn't: the logit clamp pinned every confidence above 1 - 1e-6 to
the same value, collapsing distinct *saturating* confidences into a tie. The forecast-residual
detector's confidence is 1 - exp(-z/threshold), which exceeds that for any strong detection, so
its strong true positives and false alarms tied — and the scorer's deterministic false-alarm-
first tie-break then admitted the tied false alarms first, exhausting a small class's false-alarm
budget before any true positive and zeroing its recall (observed: foundation LEO 0.49 -> 0.00).

Clamp the logit only at exactly 0/1 (where it is infinite), via nextafter, so every distinct
interior confidence keeps a distinct logit and calibration never introduces a benchmark tie that
raw confidence did not already have.

Regression tests: recall is invariant to a monotonic calibration even with saturating, tied-prone
confidences (reproduces the collapse under the old clamp), plus order-preservation at the unit
level.

Part of the v0.3 release-cut work for #109.
@djankov djankov marked this pull request as ready for review June 15, 2026 22:27
@djankov djankov merged commit 4440a73 into main Jun 15, 2026
25 of 26 checks passed
@djankov djankov deleted the issue-109-reeval-baselines-v03-igso branch June 15, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retrain + re-eval all published baselines on the v0.3 dataset (IGSO)

1 participant