Skip to content

Adding training functionalities to Toolkit#108

Draft
laserkelvin wants to merge 312 commits into
NVIDIA:mainfrom
laserkelvin:training-epic
Draft

Adding training functionalities to Toolkit#108
laserkelvin wants to merge 312 commits into
NVIDIA:mainfrom
laserkelvin:training-epic

Conversation

@laserkelvin

Copy link
Copy Markdown
Collaborator

ALCHEMI Toolkit Pull Request

Description

This PR introduces the core functionalities required to support training and fine-tuning of models in nvalchemi-toolkit.

This PR is still a WIP - do not merge!

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Changes Made

  • create_model_spec methods and dynamic pydantic model creation for pickle-less serialization of configuration
  • Adds a few base loss functions, the general loss abstraction including individual losses and a composed loss function. The latter can be adjusted with weight scheduling, allowing the relative weighting of different losses to be adjusted over the course of training
  • Adds a TrainingStrategy pydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPs
  • Adds a FineTuningStrategy that specializes TrainingStrategy for...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflow
  • Adds data loading optimizations; the main changes is addition of "batched" pre-fetching, which amortizes I/O for non-contiguous data samples. This is crucial for Zarr performance when shuffling data
  • Adds multidataset support, with a "meta" sampler that allows users to implement different cross-dataset sampling strategies (e.g. to account for dataset size imbalances)
  • Adds several training-related hooks, such as model averaging, mixed precision, checkpointing

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

laserkelvin and others added 30 commits May 13, 2026 13:34
…to-wrap and claim validation

- Add field_validator and register_hook override that auto-fold bare
  TrainingUpdateHook instances into a TrainingUpdateOrchestrator and
  merge into an existing one when present.
- Enforce single-claimant rule for DO_BACKWARD and DO_OPTIMIZER_STEP
  via _validate_single_do_claimants, catching both natural and
  explicit-stage claims with identity-based candidate detection.
- Cache orchestrator presence and DO-stage claim flags so the hot
  path in _train_one_batch skips default zero_grad/backward/step
  only when something owns those operations.
- Relocate _hook_claims_stage and _fold_training_update_hooks from
  strategy.py to training/hooks/update.py, aligning the predicate
  with the central HookRegistryMixin._call_hooks dispatch shape.
- Add a breadcrumb at BEFORE_BACKWARD pointing readers at
  TrainingUpdateOrchestrator for the loss-chain contract.
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Brings in 5 upstream commits from main:
- a85db34 Refactor hook contexts (NVIDIA#93) - splits HookContext into base +
  DynamicsContext + TrainContext
- 84d8119 chore: bumping torch minimum version to 2.8 (NVIDIA#85)
- 8f7e628 fix(dynamics): MTK NPT/NPH barostat thermostat coupling (NVIDIA#90)
- 001f1cb fix(models): tensile-positive stress convention (NVIDIA#87)
- 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88)

This propagates the new TrainContext shape to all stacked PRs in the
training-epic series (#4, #5, #6, #7, #8, #9). Stacked PR branches will
need to be rebased or merged on top of this updated training-epic.
Brings training-epic up to date with origin/main on this branch:
- a85db34 Refactor hook contexts (NVIDIA#93) - HookContext/DynamicsContext/TrainContext split
- 84d8119 chore: torch>=2.8 (NVIDIA#85)
- 8f7e628 fix(dynamics): MTK NPT/NPH barostat (NVIDIA#90)
- 001f1cb fix(models): tensile-positive stress (NVIDIA#87)
- 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88)

Replaces an earlier direct origin/main merge to ensure a single merge base
between this branch and training-epic, so the PR diff displays the true
contribution scope (training primitives only, not the upstream merge churn).

# Conflicts:
#	nvalchemi/hooks/_context.py
#	test/hooks/test_context.py
Add supporting functions for upcoming `TrainingStrategy`
Bring in PR #4 (training runtime primitives) and PR NVIDIA#93 (hook context refactor).

Conflict resolution:

- nvalchemi/hooks/_context.py: take upstream's split (HookContext base + DynamicsContext / TrainContext subclasses); keep our additions on TrainContext only:
  * grad_scaler: torch.amp.GradScaler | None = None
  * optimizers / lr_schedulers default to empty list (field(default_factory=list)) instead of None, so the orchestrator's gated-op consumers can iterate without None guards.
- test/hooks/test_context.py: take upstream verbatim, flip optimizers/lr_schedulers default assertions to == [], cover grad_scaler default + populated cases, and add test_optimizers_default_is_independent_per_instance to guard against shared-list aliasing.

Strategy + orchestrator wiring:
- TrainingStrategy._build_context now returns TrainContext and passes model=self.models["main"] to preserve the legacy ctx.model alias for hooks that read a single main model (upstream PR NVIDIA#93 decoupled model from models, so we re-establish the alias at the producer rather than via a property).
- TrainingUpdateHook / TrainingUpdateOrchestrator type hints narrowed from HookContext to TrainContext (no runtime change; TrainContext IS-A HookContext).

Verification: 146 targeted tests / 462 training / 1071 hooks+dynamics passing; make lint + make interrogate green.
…strategy-orchestration

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
laserkelvin and others added 30 commits June 11, 2026 14:41
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Support callable model specs for EMA checkpoint roundtrips
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
…support

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	nvalchemi/data/datapipes/dataset.py
Add PhysicsNeMo-compatible multidataset datapipes
Add shared profiling hooks for training and dynamics
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
update pipeline to be compatible with ema
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Fix unweighted validation loss reporting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants