Skip to content

Add train-only runner and latest checkpoint tracking#2058

Open
bkmi wants to merge 1 commit into
callbacksfrom
bkmi/runners-post-callbacks
Open

Add train-only runner and latest checkpoint tracking#2058
bkmi wants to merge 1 commit into
callbacksfrom
bkmi/runners-post-callbacks

Conversation

@bkmi

@bkmi bkmi commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a train-only runner primitive and tightens checkpoint handling after the callback refactor.

Changes:

  • Introduce TrainRunner, a core runner backed by TorchTNT train().
  • Refactor shared train runner mechanics into _BaseTrainRunner so TrainRunner and TrainEvalRunner share:
    • callback hook wiring,
    • checkpoint save/load,
    • preemption resume handling,
    • timeout callback setup.
  • Keep TrainEvalRunner as the TorchTNT fit() runner and preserve its existing train_eval_unit interface for compatibility.
  • Update TrainCheckpointCallback to:
    • maintain a latest symlink for step and final checkpoints,
    • rotate saved checkpoint directories without deleting latest,
    • validate max_saved_checkpoints >= 1,
    • remove unused stored load_callback state.
  • Add W&B tag passthrough to WandBSingletonLogger.init_wandb.
  • Add focused tests for train-only runner behavior and checkpoint rotation/latest handling.

Why

Some workflows need train-only execution without eval dataloaders or validation/test loss computation. This adds that capability as a fairchem core primitive instead of keeping it model-local, while preserving the existing train/eval runner behavior.

The latest-checkpoint symlink also gives downstream jobs and operators a stable checkpoint path while still retaining bounded checkpoint history.

Testing

Ran locally:

pytest tests/core/units/mlip_unit/test_train_eval_runner.py -q --rootdir .
pre-commit run --files \
  src/fairchem/core/components/train/train_runner.py \
  src/fairchem/core/components/train/callbacks.py \
  tests/core/units/mlip_unit/test_train_eval_runner.py
python -m ruff check \
  src/fairchem/core/components/train/train_runner.py \
  src/fairchem/core/components/train/callbacks.py \
  tests/core/units/mlip_unit/test_train_eval_runner.py \
  --select F401,F841,F821,ARG

@meta-cla meta-cla Bot added the cla signed label Jun 24, 2026
Introduce a TrainRunner core primitive backed by TorchTNT train(), while
sharing checkpoint and callback wiring with TrainEvalRunner.

Update TrainCheckpointCallback to maintain a latest symlink, validate
checkpoint retention, and cover train-only runner/checkpoint behavior in tests.
@bkmi bkmi force-pushed the bkmi/runners-post-callbacks branch from 44a8f90 to 8e78510 Compare June 24, 2026 22:59
@bkmi bkmi added enhancement New feature or request patch Patch version release labels Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request patch Patch version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant