UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1 by misko · Pull Request #2049 · facebookresearch/fairchem

misko · 2026-06-23T03:45:46Z

Summary

UMA checkpoint generations are tagged inconsistently in their model_config:

version	`backbone.model_version`	top-level `model_id`
UMA 1.0	absent	absent
UMA 1.1	`"1.1"`	absent
UMA 1.2	absent	`"UMA-S-1.2"`

Two consequences this PR fixes:

UMA 1.0 still loads silently (by path or finetune-derived checkpoint) despite being dropped from pretrained_models.json. UMA 1.0 has a known semantic divergence: eSCNMDMoeBackbone.set_MOLE_coefficients branches on np.isclose(self.model_version, 1.0) (src/fairchem/core/models/uma/escn_moe.py:149), so loading 1.0 weights with current code produces numerically different results than the original release.
UMA 1.1 has no model_id, so HydraModel.model_id is None for 1.1 (vs "UMA-S-1.2" for 1.2). Downstream code that wants to dispatch on model.model_id is blind.

This PR adds a small UMA-owned compat module that classifies the checkpoint by major.minor and applies in-place fixups at load time:

UMA 1.0 → raises RuntimeError with a path-specific message recommending pip install 'fairchem-core<=2.21.0'.
UMA 1.1 → back-fills model_id = "UMA-1.1" (warning-logged). The back-fill propagates into any subsequent finetune checkpoint automatically via the existing MLIPTrainEvalUnit.save_state → convert_train_checkpoint_to_inference_checkpoint chain.
UMA 1.2 → no-op.
Non-UMA → no-op.

The classifier prefers backbone.model_version over model_id, so a finetuned-from-1.1 checkpoint (which now carries model_id="UMA-1.1" after its first load) is correctly re-classified as 1.1 on subsequent loads. Idempotency is tested.

Sub-size (S / M / L) discrimination is intentionally out of scope — inside a single major.minor release, UMA variants are similar enough that downstream code does not need to dispatch on size.

The fixup is wired at three I/O boundaries (one would not be enough; see MLIPPredictUnit.__init__ which does its own torch.load and reads the config before delegating to load_inference_model):

fairchem.core.units.mlip_unit.utils.load_inference_model (safety net)
fairchem.core.units.mlip_unit.predict.MLIPPredictUnit.__init__ — immediately after torch.load, before maybe_update_settings_backend. UMA 1.0 raises before the ~1 GB tensor allocation.
fairchem.core.units.mlip_unit.mlip_unit.convert_train_checkpoint_to_inference_checkpoint — between torch.load and torch.save, so finetune-derived inference checkpoints get tagged on disk.

Files changed

New: src/fairchem/core/models/uma/compat.py — get_uma_version() (public), apply_uma_compat_fixups() (public). Module docstring covers call sites, override-bypass policy, idempotency, and known DCP-resume / load_tasks narrow gaps.
New: tests/core/models/uma/test_compat.py — 22 unit tests (no GPU, no network).
Modified: src/fairchem/core/units/mlip_unit/{utils.py,predict.py,mlip_unit.py} — wire the fixup at the three I/O sites.
Modified: tests/core/units/mlip_unit/test_predict.py — 3 integration tests: test_uma_1p1_predict_unit_has_model_id, test_uma_1p1_finetune_propagates_model_id, test_uma_1p0_predict_unit_raises (skip-conditional on UMA_1P0_PATH env var or ~/.cache/fairchem cache).
Modified: docs/core/uma_changelog.md — compatibility section.
Modified: docs/uma_tutorials/uma_tutorial.ipynb — retarget 9 cells from "uma-s-1" → "uma-s-1p2".
Modified: src/fairchem/applications/cattsunami/DATASET.md — same retarget.

Test plan

pytest tests/core/models/uma/test_compat.py — 22/22 pass (no GPU, no network).
pytest tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_predict_unit_has_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_finetune_propagates_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p0_predict_unit_raises — 3/3 pass.
End-to-end against real cached checkpoints (uma-s-1.pt, uma-s-1p1.pt, uma-m-1p1.pt, uma-s-1p2.pt): 1.0 raises with full message, 1.1 small + medium back-fill to "UMA-1.1", 1.2 unchanged.
Finetune-propagation: initialize_finetuning_model(uma-s-1p1.pt) → model.finetune_model_full_config["model_id"] == "UMA-1.1" (and same for medium).
Non-UMA hydra smoke test: esen-style backbone → not_uma, zero mutations, zero logs.
CI green (will run on push).

Notes / out of scope

No deprecation cycle for UMA 1.0 — hard fail because silent loads produce wrong numbers.
No env-var escape hatch.
No re-saving of shipped UMA 1.1 checkpoints — fixup is in-memory at load time.
MLIPTrainEvalUnit._execute_load_state (DCP train-resume) and load_tasks are documented narrow gaps; not instrumented here.

UMA checkpoint generations are inconsistently tagged: 1.0 has neither `backbone.model_version` nor top-level `model_id`; 1.1 carries `backbone.model_version="1.1"` but no `model_id`; 1.2 has `model_id="UMA-S-1.2"` and no `model_version`. This change introduces a single source of truth for UMA version identification and applies in-place fixups at load time. - New `fairchem.core.models.uma.compat` module exposes `get_uma_version()` (public classifier returning "1.0"/"1.1"/"1.2"/ "unknown_uma"/"not_uma") and `apply_uma_compat_fixups()` (raises RuntimeError for 1.0, back-fills `model_id="UMA-1.1"` for 1.1). - UMA 1.0 hard-fails with a path-specific RuntimeError pointing at `pip install 'fairchem-core<=2.21.0'`. Justification: the `eSCNMDMoeBackbone` composition-reduction `include_self` flag branches on `np.isclose(model_version, 1.0)`, so silent loads produce numerically different results. - Fixup wired at three I/O boundaries (idempotent): `load_inference_model`, `MLIPPredictUnit.__init__` (before `maybe_update_settings_backend`), and `convert_train_checkpoint_to_inference_checkpoint`. - Classifier prefers `backbone.model_version` over `model_id` so finetuned-from-1.1 checkpoints (which carry the back-filled `model_id` after the first load) are correctly re-classified as 1.1. - Sub-size (S/M/L) is intentionally not encoded; back-fill is the bare "UMA-1.1". - Tests: 22 unit tests in `tests/core/models/uma/test_compat.py` (DictConfig + struct-mode, registry short-name, empty/whitespace `model_id`, idempotency with S/M/L-suffixed forms, etc.) plus 3 integration tests in `tests/core/units/mlip_unit/test_predict.py` (1.1 predict has model_id, finetune propagates the back-fill, 1.0 raises). - Docs/notebook sweep: retarget 10 `"uma-s-1"` references in the UMA tutorial + cattsunami DATASET.md to `"uma-s-1p2"`; add a compatibility section to `docs/core/uma_changelog.md`.

meta-cla Bot added the cla signed label Jun 23, 2026

wood-b self-requested a review June 25, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1#2049

UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1#2049
misko wants to merge 1 commit into
mainfrom
uma/deprecate-1p0-backfill-1p1-model-id

misko commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

misko commented Jun 23, 2026

Summary

Files changed

Test plan

Notes / out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant