Adding training functionalities to Toolkit#108
Draft
laserkelvin wants to merge 312 commits into
Draft
Conversation
…to-wrap and claim validation - Add field_validator and register_hook override that auto-fold bare TrainingUpdateHook instances into a TrainingUpdateOrchestrator and merge into an existing one when present. - Enforce single-claimant rule for DO_BACKWARD and DO_OPTIMIZER_STEP via _validate_single_do_claimants, catching both natural and explicit-stage claims with identity-based candidate detection. - Cache orchestrator presence and DO-stage claim flags so the hot path in _train_one_batch skips default zero_grad/backward/step only when something owns those operations. - Relocate _hook_claims_stage and _fold_training_update_hooks from strategy.py to training/hooks/update.py, aligning the predicate with the central HookRegistryMixin._call_hooks dispatch shape. - Add a breadcrumb at BEFORE_BACKWARD pointing readers at TrainingUpdateOrchestrator for the loss-chain contract.
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Brings in 5 upstream commits from main: - a85db34 Refactor hook contexts (NVIDIA#93) - splits HookContext into base + DynamicsContext + TrainContext - 84d8119 chore: bumping torch minimum version to 2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat thermostat coupling (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress convention (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) This propagates the new TrainContext shape to all stacked PRs in the training-epic series (#4, #5, #6, #7, #8, #9). Stacked PR branches will need to be rebased or merged on top of this updated training-epic.
Brings training-epic up to date with origin/main on this branch: - a85db34 Refactor hook contexts (NVIDIA#93) - HookContext/DynamicsContext/TrainContext split - 84d8119 chore: torch>=2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) Replaces an earlier direct origin/main merge to ensure a single merge base between this branch and training-epic, so the PR diff displays the true contribution scope (training primitives only, not the upstream merge churn). # Conflicts: # nvalchemi/hooks/_context.py # test/hooks/test_context.py
Add supporting functions for upcoming `TrainingStrategy`
Bring in PR #4 (training runtime primitives) and PR NVIDIA#93 (hook context refactor). Conflict resolution: - nvalchemi/hooks/_context.py: take upstream's split (HookContext base + DynamicsContext / TrainContext subclasses); keep our additions on TrainContext only: * grad_scaler: torch.amp.GradScaler | None = None * optimizers / lr_schedulers default to empty list (field(default_factory=list)) instead of None, so the orchestrator's gated-op consumers can iterate without None guards. - test/hooks/test_context.py: take upstream verbatim, flip optimizers/lr_schedulers default assertions to == [], cover grad_scaler default + populated cases, and add test_optimizers_default_is_independent_per_instance to guard against shared-list aliasing. Strategy + orchestrator wiring: - TrainingStrategy._build_context now returns TrainContext and passes model=self.models["main"] to preserve the legacy ctx.model alias for hooks that read a single main model (upstream PR NVIDIA#93 decoupled model from models, so we re-establish the alias at the producer rather than via a property). - TrainingUpdateHook / TrainingUpdateOrchestrator type hints narrowed from HookContext to TrainContext (no runtime change; TrainContext IS-A HookContext). Verification: 146 targeted tests / 462 training / 1071 hooks+dynamics passing; make lint + make interrogate green.
…strategy-orchestration Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
# Conflicts: # CHANGELOG.md
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
match tensor dtype in ema
Support callable model specs for EMA checkpoint roundtrips
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
…support Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # nvalchemi/data/datapipes/dataset.py
Add PhysicsNeMo-compatible multidataset datapipes
Add shared profiling hooks for training and dynamics
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Adding workflow reporter abstraction
update pipeline to be compatible with ema
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Fix unweighted validation loss reporting
Add Fix from NGNP Integration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ALCHEMI Toolkit Pull Request
Description
This PR introduces the core functionalities required to support training and fine-tuning of models in
nvalchemi-toolkit.This PR is still a WIP - do not merge!
Type of Change
Related Issues
Changes Made
create_model_specmethods and dynamic pydantic model creation forpickle-less serialization of configurationTrainingStrategypydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPsFineTuningStrategythat specializesTrainingStrategyfor...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflowTesting
make pytest)make lint)Checklist
Additional Notes
Tip
This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.