Adding training functionalities to Toolkit by laserkelvin · Pull Request #108 · NVIDIA/nvalchemi-toolkit

laserkelvin · 2026-06-09T04:33:14Z

ALCHEMI Toolkit Pull Request

Description

This PR introduces the core functionalities required to support training and fine-tuning of models in nvalchemi-toolkit.

This PR is still a WIP - do not merge!

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Documentation update
Refactoring (no functional changes)
CI/CD or infrastructure change

Related Issues

Changes Made

create_model_spec methods and dynamic pydantic model creation for pickle-less serialization of configuration
Adds a few base loss functions, the general loss abstraction including individual losses and a composed loss function. The latter can be adjusted with weight scheduling, allowing the relative weighting of different losses to be adjusted over the course of training
Adds a TrainingStrategy pydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPs
Adds a FineTuningStrategy that specializes TrainingStrategy for...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflow
Adds data loading optimizations; the main changes is addition of "batched" pre-fetching, which amortizes I/O for non-contiguous data samples. This is crucial for Zarr performance when shuffling data
Adds multidataset support, with a "meta" sampler that allows users to implement different cross-dataset sampling strategies (e.g. to account for dataset size imbalances)
Adds several training-related hooks, such as model averaging, mixed precision, checkpointing

Testing

Unit tests pass locally (make pytest)
Linting passes (make lint)
New tests added for new functionality meets coverage expectations?

Checklist

I have read and understand the Contributing Guidelines
I have updated the CHANGELOG.md
I have performed a self-review of my code
I have added docstrings to new functions/classes
I have updated the documentation (if applicable)

Additional Notes

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

…to-wrap and claim validation - Add field_validator and register_hook override that auto-fold bare TrainingUpdateHook instances into a TrainingUpdateOrchestrator and merge into an existing one when present. - Enforce single-claimant rule for DO_BACKWARD and DO_OPTIMIZER_STEP via _validate_single_do_claimants, catching both natural and explicit-stage claims with identity-based candidate detection. - Cache orchestrator presence and DO-stage claim flags so the hot path in _train_one_batch skips default zero_grad/backward/step only when something owns those operations. - Relocate _hook_claims_stage and _fold_training_update_hooks from strategy.py to training/hooks/update.py, aligning the predicate with the central HookRegistryMixin._call_hooks dispatch shape. - Add a breadcrumb at BEFORE_BACKWARD pointing readers at TrainingUpdateOrchestrator for the loss-chain contract.

…ator integration

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Brings in 5 upstream commits from main: - a85db34 Refactor hook contexts (NVIDIA#93) - splits HookContext into base + DynamicsContext + TrainContext - 84d8119 chore: bumping torch minimum version to 2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat thermostat coupling (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress convention (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) This propagates the new TrainContext shape to all stacked PRs in the training-epic series (#4, #5, #6, #7, #8, #9). Stacked PR branches will need to be rebased or merged on top of this updated training-epic.

Brings training-epic up to date with origin/main on this branch: - a85db34 Refactor hook contexts (NVIDIA#93) - HookContext/DynamicsContext/TrainContext split - 84d8119 chore: torch>=2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) Replaces an earlier direct origin/main merge to ensure a single merge base between this branch and training-epic, so the PR diff displays the true contribution scope (training primitives only, not the upstream merge churn). # Conflicts: # nvalchemi/hooks/_context.py # test/hooks/test_context.py

Add supporting functions for upcoming `TrainingStrategy`

Bring in PR #4 (training runtime primitives) and PR NVIDIA#93 (hook context refactor). Conflict resolution: - nvalchemi/hooks/_context.py: take upstream's split (HookContext base + DynamicsContext / TrainContext subclasses); keep our additions on TrainContext only: * grad_scaler: torch.amp.GradScaler | None = None * optimizers / lr_schedulers default to empty list (field(default_factory=list)) instead of None, so the orchestrator's gated-op consumers can iterate without None guards. - test/hooks/test_context.py: take upstream verbatim, flip optimizers/lr_schedulers default assertions to == [], cover grad_scaler default + populated cases, and add test_optimizers_default_is_independent_per_instance to guard against shared-list aliasing. Strategy + orchestrator wiring: - TrainingStrategy._build_context now returns TrainContext and passes model=self.models["main"] to preserve the legacy ctx.model alias for hooks that read a single main model (upstream PR NVIDIA#93 decoupled model from models, so we re-establish the alias at the producer rather than via a property). - TrainingUpdateHook / TrainingUpdateOrchestrator type hints narrowed from HookContext to TrainContext (no runtime change; TrainContext IS-A HookContext). Verification: 146 targeted tests / 462 training / 1071 hooks+dynamics passing; make lint + make interrogate green.

…strategy-orchestration Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts: # CHANGELOG.md

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

match tensor dtype in ema

Support callable model specs for EMA checkpoint roundtrips

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

…support Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # nvalchemi/data/datapipes/dataset.py

Add PhysicsNeMo-compatible multidataset datapipes

Add shared profiling hooks for training and dynamics

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Adding workflow reporter abstraction

update pipeline to be compatible with ema

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

Fix unweighted validation loss reporting

Add Fix from NGNP Integration

laserkelvin and others added 30 commits May 13, 2026 13:34

Align CUDA dependency variants

58f00bc

Document uv sync CUDA setup

699d66f

test(training/hooks): cover TrainingUpdateHook framework and orchestr…

b5f2ef3

…ator integration

Preserve CUDA variant for uv run

2891f3b

fix(training): harden serialization primitives

0d347bd

docs: clarifying docstring for model in hook context

b14cce1

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

feat(training): add TrainingStrategy orchestration

dbf837f

Align MACE CUDA extras

c3d6eec

Compose MACE with CUDA extras

6fa8210

chore: excluding darwin on sys_platform

cf089a6

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Pin CI sync to CUDA 13

927e071

Clarify CUDA install index

f55f5d1

docs: removing cu13 specification for io test

f778516

docs: clarifying bind

74c2fdd

Merge main dependency floor

273ee9f

chore: removing explicit torch pins

0890dcb

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

docs: aligning cu specification in README

fdf2c90

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

docs: catching remaining cu130 mentions

acd1e34

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Merge pull request #4 from laserkelvin/feat-training-runtime-primitives

15dcd2c

Add supporting functions for upcoming `TrainingStrategy`

Merge remote-tracking branch 'fork/training-epic' into feat-training-…

efaf180

…strategy-orchestration Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Address training strategy review feedback

dda8374

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Harden restored model specs

ea53486

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Preserve composed loss weights in specs

04f03b9

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Preserve training model call mode in specs

85b2f63

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Support ModuleDict in optimizer setup

ba81a7c

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Reject empty optimizer configs

a091c31

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin and others added 30 commits June 11, 2026 14:41

Fix EMA tensor device restoration

bf95db4

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Add EMA device restoration coverage

c9ee6cf

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Test MACE EMA checkpoint roundtrip

0a6e675

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Route torchvision through CUDA indexes

fcb39c4

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Remove torchvision fake op patch from MACE tests

115a16f

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Use strategy checkpoint path for MACE EMA test

851d5a2

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Support callable model specs for MACE checkpoints

56b0c29

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Clarify MACE EMA strategy checkpoint test

d9043f2

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Document EMA checkpoint reconstruction fix

d5f0946

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Merge remote-tracking branch 'fork/training-epic' into cueq-mace-fix

cd38cb0

# Conflicts: # CHANGELOG.md

Publish restored EMA before validation

c9fa621

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Initialize EMA during training setup

5dd9d76

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

match tensor dtype in ema

0f51499

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

Clarify EMA stage dispatch

210a4a7

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Merge pull request #26 from ys-teh/fix/ema_dype

7454cb4

match tensor dtype in ema

Merge pull request #25 from laserkelvin/cueq-mace-fix

14ce0cc

Support callable model specs for EMA checkpoint roundtrips

update pipeline to be compatible with ema

d077e7f

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…

c9349ab

…support Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # nvalchemi/data/datapipes/dataset.py

Merge pull request #17 from laserkelvin/multi-dataset-support

1a50571

Add PhysicsNeMo-compatible multidataset datapipes

Merge pull request #23 from laserkelvin/feat-physicsnemo-profiler-hook

b9254ba

Add shared profiling hooks for training and dynamics

Merge fork/training-epic into feat-reporting-abstraction

b9855ee

fix(hooks): align reporting loss component keys

e032012

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

Merge pull request #16 from laserkelvin/feat-reporting-abstraction

c0b8907

Adding workflow reporter abstraction

Merge pull request #28 from ys-teh/fix/pipeline-compatiblity-w-ema

0e3b818

update pipeline to be compatible with ema

fix bug on unweighted loss reporting

6a9a7da

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

add fix

17cd3b5

Add ema build override site

e677a7e

Update change log

3d61b75

Merge pull request #29 from ys-teh/fix/unweighted-loss-reporting

950614b

Fix unweighted validation loss reporting

Merge pull request #30 from EricZQu/ngnp-integration-feedback

987f778

Add Fix from NGNP Integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding training functionalities to Toolkit#108

Adding training functionalities to Toolkit#108
laserkelvin wants to merge 312 commits into
NVIDIA:mainfrom
laserkelvin:training-epic

laserkelvin commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

laserkelvin commented Jun 9, 2026

ALCHEMI Toolkit Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants