Skip to content

UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1#2049

Open
misko wants to merge 1 commit into
mainfrom
uma/deprecate-1p0-backfill-1p1-model-id
Open

UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1#2049
misko wants to merge 1 commit into
mainfrom
uma/deprecate-1p0-backfill-1p1-model-id

Conversation

@misko

@misko misko commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

UMA checkpoint generations are tagged inconsistently in their model_config:

version backbone.model_version top-level model_id
UMA 1.0 absent absent
UMA 1.1 "1.1" absent
UMA 1.2 absent "UMA-S-1.2"

Two consequences this PR fixes:

  1. UMA 1.0 still loads silently (by path or finetune-derived checkpoint) despite being dropped from pretrained_models.json. UMA 1.0 has a known semantic divergence: eSCNMDMoeBackbone.set_MOLE_coefficients branches on np.isclose(self.model_version, 1.0) (src/fairchem/core/models/uma/escn_moe.py:149), so loading 1.0 weights with current code produces numerically different results than the original release.
  2. UMA 1.1 has no model_id, so HydraModel.model_id is None for 1.1 (vs "UMA-S-1.2" for 1.2). Downstream code that wants to dispatch on model.model_id is blind.

This PR adds a small UMA-owned compat module that classifies the checkpoint by major.minor and applies in-place fixups at load time:

  • UMA 1.0 → raises RuntimeError with a path-specific message recommending pip install 'fairchem-core<=2.21.0'.
  • UMA 1.1 → back-fills model_id = "UMA-1.1" (warning-logged). The back-fill propagates into any subsequent finetune checkpoint automatically via the existing MLIPTrainEvalUnit.save_stateconvert_train_checkpoint_to_inference_checkpoint chain.
  • UMA 1.2 → no-op.
  • Non-UMA → no-op.

The classifier prefers backbone.model_version over model_id, so a finetuned-from-1.1 checkpoint (which now carries model_id="UMA-1.1" after its first load) is correctly re-classified as 1.1 on subsequent loads. Idempotency is tested.

Sub-size (S / M / L) discrimination is intentionally out of scope — inside a single major.minor release, UMA variants are similar enough that downstream code does not need to dispatch on size.

The fixup is wired at three I/O boundaries (one would not be enough; see MLIPPredictUnit.__init__ which does its own torch.load and reads the config before delegating to load_inference_model):

  1. fairchem.core.units.mlip_unit.utils.load_inference_model (safety net)
  2. fairchem.core.units.mlip_unit.predict.MLIPPredictUnit.__init__ — immediately after torch.load, before maybe_update_settings_backend. UMA 1.0 raises before the ~1 GB tensor allocation.
  3. fairchem.core.units.mlip_unit.mlip_unit.convert_train_checkpoint_to_inference_checkpoint — between torch.load and torch.save, so finetune-derived inference checkpoints get tagged on disk.

Files changed

  • New: src/fairchem/core/models/uma/compat.pyget_uma_version() (public), apply_uma_compat_fixups() (public). Module docstring covers call sites, override-bypass policy, idempotency, and known DCP-resume / load_tasks narrow gaps.
  • New: tests/core/models/uma/test_compat.py — 22 unit tests (no GPU, no network).
  • Modified: src/fairchem/core/units/mlip_unit/{utils.py,predict.py,mlip_unit.py} — wire the fixup at the three I/O sites.
  • Modified: tests/core/units/mlip_unit/test_predict.py — 3 integration tests: test_uma_1p1_predict_unit_has_model_id, test_uma_1p1_finetune_propagates_model_id, test_uma_1p0_predict_unit_raises (skip-conditional on UMA_1P0_PATH env var or ~/.cache/fairchem cache).
  • Modified: docs/core/uma_changelog.md — compatibility section.
  • Modified: docs/uma_tutorials/uma_tutorial.ipynb — retarget 9 cells from "uma-s-1""uma-s-1p2".
  • Modified: src/fairchem/applications/cattsunami/DATASET.md — same retarget.

Test plan

  • pytest tests/core/models/uma/test_compat.py — 22/22 pass (no GPU, no network).
  • pytest tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_predict_unit_has_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_finetune_propagates_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p0_predict_unit_raises — 3/3 pass.
  • End-to-end against real cached checkpoints (uma-s-1.pt, uma-s-1p1.pt, uma-m-1p1.pt, uma-s-1p2.pt): 1.0 raises with full message, 1.1 small + medium back-fill to "UMA-1.1", 1.2 unchanged.
  • Finetune-propagation: initialize_finetuning_model(uma-s-1p1.pt)model.finetune_model_full_config["model_id"] == "UMA-1.1" (and same for medium).
  • Non-UMA hydra smoke test: esen-style backbone → not_uma, zero mutations, zero logs.
  • CI green (will run on push).

Notes / out of scope

  • No deprecation cycle for UMA 1.0 — hard fail because silent loads produce wrong numbers.
  • No env-var escape hatch.
  • No re-saving of shipped UMA 1.1 checkpoints — fixup is in-memory at load time.
  • MLIPTrainEvalUnit._execute_load_state (DCP train-resume) and load_tasks are documented narrow gaps; not instrumented here.

UMA checkpoint generations are inconsistently tagged: 1.0 has neither
`backbone.model_version` nor top-level `model_id`; 1.1 carries
`backbone.model_version="1.1"` but no `model_id`; 1.2 has
`model_id="UMA-S-1.2"` and no `model_version`. This change introduces
a single source of truth for UMA version identification and applies
in-place fixups at load time.

- New `fairchem.core.models.uma.compat` module exposes
  `get_uma_version()` (public classifier returning "1.0"/"1.1"/"1.2"/
  "unknown_uma"/"not_uma") and `apply_uma_compat_fixups()` (raises
  RuntimeError for 1.0, back-fills `model_id="UMA-1.1"` for 1.1).
- UMA 1.0 hard-fails with a path-specific RuntimeError pointing at
  `pip install 'fairchem-core<=2.21.0'`. Justification: the
  `eSCNMDMoeBackbone` composition-reduction `include_self` flag
  branches on `np.isclose(model_version, 1.0)`, so silent loads
  produce numerically different results.
- Fixup wired at three I/O boundaries (idempotent):
  `load_inference_model`, `MLIPPredictUnit.__init__` (before
  `maybe_update_settings_backend`), and
  `convert_train_checkpoint_to_inference_checkpoint`.
- Classifier prefers `backbone.model_version` over `model_id` so
  finetuned-from-1.1 checkpoints (which carry the back-filled
  `model_id` after the first load) are correctly re-classified as 1.1.
- Sub-size (S/M/L) is intentionally not encoded; back-fill is the bare
  "UMA-1.1".
- Tests: 22 unit tests in `tests/core/models/uma/test_compat.py`
  (DictConfig + struct-mode, registry short-name, empty/whitespace
  `model_id`, idempotency with S/M/L-suffixed forms, etc.) plus 3
  integration tests in `tests/core/units/mlip_unit/test_predict.py`
  (1.1 predict has model_id, finetune propagates the back-fill, 1.0
  raises).
- Docs/notebook sweep: retarget 10 `"uma-s-1"` references in the UMA
  tutorial + cattsunami DATASET.md to `"uma-s-1p2"`; add a
  compatibility section to `docs/core/uma_changelog.md`.
@meta-cla meta-cla Bot added the cla signed label Jun 23, 2026
@wood-b wood-b self-requested a review June 25, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant