UMA: deprecate 1.0 checkpoints, back-fill model_id for 1.1#2049
Open
misko wants to merge 1 commit into
Open
Conversation
UMA checkpoint generations are inconsistently tagged: 1.0 has neither `backbone.model_version` nor top-level `model_id`; 1.1 carries `backbone.model_version="1.1"` but no `model_id`; 1.2 has `model_id="UMA-S-1.2"` and no `model_version`. This change introduces a single source of truth for UMA version identification and applies in-place fixups at load time. - New `fairchem.core.models.uma.compat` module exposes `get_uma_version()` (public classifier returning "1.0"/"1.1"/"1.2"/ "unknown_uma"/"not_uma") and `apply_uma_compat_fixups()` (raises RuntimeError for 1.0, back-fills `model_id="UMA-1.1"` for 1.1). - UMA 1.0 hard-fails with a path-specific RuntimeError pointing at `pip install 'fairchem-core<=2.21.0'`. Justification: the `eSCNMDMoeBackbone` composition-reduction `include_self` flag branches on `np.isclose(model_version, 1.0)`, so silent loads produce numerically different results. - Fixup wired at three I/O boundaries (idempotent): `load_inference_model`, `MLIPPredictUnit.__init__` (before `maybe_update_settings_backend`), and `convert_train_checkpoint_to_inference_checkpoint`. - Classifier prefers `backbone.model_version` over `model_id` so finetuned-from-1.1 checkpoints (which carry the back-filled `model_id` after the first load) are correctly re-classified as 1.1. - Sub-size (S/M/L) is intentionally not encoded; back-fill is the bare "UMA-1.1". - Tests: 22 unit tests in `tests/core/models/uma/test_compat.py` (DictConfig + struct-mode, registry short-name, empty/whitespace `model_id`, idempotency with S/M/L-suffixed forms, etc.) plus 3 integration tests in `tests/core/units/mlip_unit/test_predict.py` (1.1 predict has model_id, finetune propagates the back-fill, 1.0 raises). - Docs/notebook sweep: retarget 10 `"uma-s-1"` references in the UMA tutorial + cattsunami DATASET.md to `"uma-s-1p2"`; add a compatibility section to `docs/core/uma_changelog.md`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
UMA checkpoint generations are tagged inconsistently in their
model_config:backbone.model_versionmodel_id"1.1""UMA-S-1.2"Two consequences this PR fixes:
pretrained_models.json. UMA 1.0 has a known semantic divergence:eSCNMDMoeBackbone.set_MOLE_coefficientsbranches onnp.isclose(self.model_version, 1.0)(src/fairchem/core/models/uma/escn_moe.py:149), so loading 1.0 weights with current code produces numerically different results than the original release.model_id, soHydraModel.model_idisNonefor 1.1 (vs"UMA-S-1.2"for 1.2). Downstream code that wants to dispatch onmodel.model_idis blind.This PR adds a small UMA-owned compat module that classifies the checkpoint by major.minor and applies in-place fixups at load time:
RuntimeErrorwith a path-specific message recommendingpip install 'fairchem-core<=2.21.0'.model_id = "UMA-1.1"(warning-logged). The back-fill propagates into any subsequent finetune checkpoint automatically via the existingMLIPTrainEvalUnit.save_state→convert_train_checkpoint_to_inference_checkpointchain.The classifier prefers
backbone.model_versionovermodel_id, so a finetuned-from-1.1 checkpoint (which now carriesmodel_id="UMA-1.1"after its first load) is correctly re-classified as 1.1 on subsequent loads. Idempotency is tested.Sub-size (S / M / L) discrimination is intentionally out of scope — inside a single major.minor release, UMA variants are similar enough that downstream code does not need to dispatch on size.
The fixup is wired at three I/O boundaries (one would not be enough; see
MLIPPredictUnit.__init__which does its owntorch.loadand reads the config before delegating toload_inference_model):fairchem.core.units.mlip_unit.utils.load_inference_model(safety net)fairchem.core.units.mlip_unit.predict.MLIPPredictUnit.__init__— immediately aftertorch.load, beforemaybe_update_settings_backend. UMA 1.0 raises before the ~1 GB tensor allocation.fairchem.core.units.mlip_unit.mlip_unit.convert_train_checkpoint_to_inference_checkpoint— betweentorch.loadandtorch.save, so finetune-derived inference checkpoints get tagged on disk.Files changed
src/fairchem/core/models/uma/compat.py—get_uma_version()(public),apply_uma_compat_fixups()(public). Module docstring covers call sites, override-bypass policy, idempotency, and known DCP-resume /load_tasksnarrow gaps.tests/core/models/uma/test_compat.py— 22 unit tests (no GPU, no network).src/fairchem/core/units/mlip_unit/{utils.py,predict.py,mlip_unit.py}— wire the fixup at the three I/O sites.tests/core/units/mlip_unit/test_predict.py— 3 integration tests:test_uma_1p1_predict_unit_has_model_id,test_uma_1p1_finetune_propagates_model_id,test_uma_1p0_predict_unit_raises(skip-conditional onUMA_1P0_PATHenv var or~/.cache/fairchemcache).docs/core/uma_changelog.md— compatibility section.docs/uma_tutorials/uma_tutorial.ipynb— retarget 9 cells from"uma-s-1"→"uma-s-1p2".src/fairchem/applications/cattsunami/DATASET.md— same retarget.Test plan
pytest tests/core/models/uma/test_compat.py— 22/22 pass (no GPU, no network).pytest tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_predict_unit_has_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p1_finetune_propagates_model_id tests/core/units/mlip_unit/test_predict.py::test_uma_1p0_predict_unit_raises— 3/3 pass.uma-s-1.pt,uma-s-1p1.pt,uma-m-1p1.pt,uma-s-1p2.pt): 1.0 raises with full message, 1.1 small + medium back-fill to"UMA-1.1", 1.2 unchanged.initialize_finetuning_model(uma-s-1p1.pt)→model.finetune_model_full_config["model_id"] == "UMA-1.1"(and same for medium).not_uma, zero mutations, zero logs.Notes / out of scope
MLIPTrainEvalUnit._execute_load_state(DCP train-resume) andload_tasksare documented narrow gaps; not instrumented here.