Add train-only runner and latest checkpoint tracking by bkmi · Pull Request #2058 · facebookresearch/fairchem

bkmi · 2026-06-24T22:51:30Z

Summary

Adds a train-only runner primitive and tightens checkpoint handling after the callback refactor.

Changes:

Introduce TrainRunner, a core runner backed by TorchTNT train().
Refactor shared train runner mechanics into _BaseTrainRunner so TrainRunner and TrainEvalRunner share:
- callback hook wiring,
- checkpoint save/load,
- preemption resume handling,
- timeout callback setup.
Keep TrainEvalRunner as the TorchTNT fit() runner and preserve its existing train_eval_unit interface for compatibility.
Update TrainCheckpointCallback to:
- maintain a latest symlink for step and final checkpoints,
- rotate saved checkpoint directories without deleting latest,
- validate max_saved_checkpoints >= 1,
- remove unused stored load_callback state.
Add W&B tag passthrough to WandBSingletonLogger.init_wandb.
Add focused tests for train-only runner behavior and checkpoint rotation/latest handling.

Why

Some workflows need train-only execution without eval dataloaders or validation/test loss computation. This adds that capability as a fairchem core primitive instead of keeping it model-local, while preserving the existing train/eval runner behavior.

The latest-checkpoint symlink also gives downstream jobs and operators a stable checkpoint path while still retaining bounded checkpoint history.

Testing

Ran locally:

pytest tests/core/units/mlip_unit/test_train_eval_runner.py -q --rootdir .
pre-commit run --files \
  src/fairchem/core/components/train/train_runner.py \
  src/fairchem/core/components/train/callbacks.py \
  tests/core/units/mlip_unit/test_train_eval_runner.py
python -m ruff check \
  src/fairchem/core/components/train/train_runner.py \
  src/fairchem/core/components/train/callbacks.py \
  tests/core/units/mlip_unit/test_train_eval_runner.py \
  --select F401,F841,F821,ARG

Introduce a TrainRunner core primitive backed by TorchTNT train(), while sharing checkpoint and callback wiring with TrainEvalRunner. Update TrainCheckpointCallback to maintain a latest symlink, validate checkpoint retention, and cover train-only runner/checkpoint behavior in tests.

meta-cla Bot added the cla signed label Jun 24, 2026

bkmi force-pushed the bkmi/runners-post-callbacks branch from 44a8f90 to 8e78510 Compare June 24, 2026 22:59

bkmi added enhancement New feature or request patch Patch version release labels Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add train-only runner and latest checkpoint tracking#2058

Add train-only runner and latest checkpoint tracking#2058
bkmi wants to merge 1 commit into
callbacksfrom
bkmi/runners-post-callbacks

bkmi commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bkmi commented Jun 24, 2026

Summary

Why

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant