Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
312 commits
Select commit Hold shift + click to select a range
68bce43
feat(training/strategy): integrate TrainingUpdateOrchestrator with au…
laserkelvin May 13, 2026
58f00bc
Align CUDA dependency variants
laserkelvin May 13, 2026
699d66f
Document uv sync CUDA setup
laserkelvin May 13, 2026
b5f2ef3
test(training/hooks): cover TrainingUpdateHook framework and orchestr…
laserkelvin May 13, 2026
2891f3b
Preserve CUDA variant for uv run
laserkelvin May 13, 2026
0d347bd
fix(training): harden serialization primitives
laserkelvin May 13, 2026
b14cce1
docs: clarifying docstring for model in hook context
laserkelvin May 13, 2026
dbf837f
feat(training): add TrainingStrategy orchestration
laserkelvin May 7, 2026
c3d6eec
Align MACE CUDA extras
laserkelvin May 14, 2026
6fa8210
Compose MACE with CUDA extras
laserkelvin May 14, 2026
cf089a6
chore: excluding darwin on sys_platform
laserkelvin May 14, 2026
927e071
Pin CI sync to CUDA 13
laserkelvin May 14, 2026
f55f5d1
Clarify CUDA install index
laserkelvin May 14, 2026
f778516
docs: removing cu13 specification for io test
laserkelvin May 14, 2026
74c2fdd
docs: clarifying bind
laserkelvin May 14, 2026
273ee9f
Merge main dependency floor
laserkelvin May 14, 2026
0890dcb
chore: removing explicit torch pins
laserkelvin May 14, 2026
fdf2c90
docs: aligning cu specification in README
laserkelvin May 14, 2026
acd1e34
docs: catching remaining cu130 mentions
laserkelvin May 14, 2026
f3e719d
Merge origin/main into training-epic
laserkelvin May 17, 2026
9f79849
Merge training-epic into feat-training-runtime-primitives
laserkelvin May 18, 2026
15dcd2c
Merge pull request #4 from laserkelvin/feat-training-runtime-primitives
laserkelvin May 18, 2026
498e138
Merge branch 'training-epic' into feat-training-update-orchestrator
laserkelvin May 18, 2026
efaf180
Merge remote-tracking branch 'fork/training-epic' into feat-training-…
laserkelvin May 18, 2026
dda8374
Address training strategy review feedback
laserkelvin May 18, 2026
ea53486
Harden restored model specs
laserkelvin May 18, 2026
04f03b9
Preserve composed loss weights in specs
laserkelvin May 18, 2026
85b2f63
Preserve training model call mode in specs
laserkelvin May 18, 2026
ba81a7c
Support ModuleDict in optimizer setup
laserkelvin May 18, 2026
a091c31
Reject empty optimizer configs
laserkelvin May 18, 2026
6344265
Cover training strategy validation gaps
laserkelvin May 18, 2026
108ebcb
Restore strategy validation messages
laserkelvin May 18, 2026
368c856
Cache constructor serialization introspection
laserkelvin May 18, 2026
7d0f6b2
Avoid duplicate freeze parameter traversal
laserkelvin May 19, 2026
372ebbb
feat(training): add MixedPrecisionHook
laserkelvin May 11, 2026
8271a08
test(training): extract shared training test fixtures to conftest
laserkelvin May 12, 2026
2a5fe45
fix(training/hooks): skip post-backward update veto validation
laserkelvin May 13, 2026
45f2b5d
fix(training): align AMP unscale with optimizer steps
laserkelvin May 20, 2026
d149a42
docs(training): document mixed precision hooks
laserkelvin May 20, 2026
083dadd
fix(training): narrow AMP autocast scope
laserkelvin May 20, 2026
35ec6b2
refactor(training): dispatch mixed precision hook stages
laserkelvin May 20, 2026
0b80ce0
feat(training): add EMAHook core for exponential moving average
laserkelvin May 12, 2026
00a8fce
feat(training): add EMAHook state_dict and load_state_dict for checkp…
laserkelvin May 12, 2026
48930b4
test(training): add EMAHook unit tests
laserkelvin May 13, 2026
ca211ea
docs(training): document EMAHook checkpoint recipe
laserkelvin May 13, 2026
3362e01
Merge pull request #5 from laserkelvin/feat-training-strategy-orchest…
laserkelvin May 20, 2026
f13df53
Merge remote-tracking branch 'fork/training-epic' into feat-training-…
laserkelvin May 20, 2026
7817070
Merge branch 'main' into add-pnm-dependency
laserkelvin May 20, 2026
ec9bb80
docs: improving docstrings for training update hook
laserkelvin May 21, 2026
73b9995
fix(training): clear train context after batch failures
laserkelvin May 21, 2026
99c35be
fix(training): expose optimizer step skip state
laserkelvin May 21, 2026
5be961f
fix(training): preserve update hook insertion order
laserkelvin May 21, 2026
c08544b
refactor(training): dispatch update stages directly
laserkelvin May 21, 2026
d71d283
docs(training): clarify update stage context
laserkelvin May 21, 2026
083d03d
test(training): cover update stage ownership
laserkelvin May 21, 2026
96e0df9
refactor: removing skipping attributes from training context
laserkelvin May 22, 2026
c06d985
refactor(training): use match for update stage dispatch
laserkelvin May 22, 2026
0f85a21
feat(training): expose single-batch training flow
laserkelvin May 22, 2026
bc1bd00
docs(training): document update hook constraints
laserkelvin May 22, 2026
ef5831d
refactor(training): clarify optimizer lifecycle boundaries
laserkelvin May 22, 2026
3fb554f
fix(training): run to target step count
laserkelvin May 23, 2026
18f79cf
fix(training): resume dataloader epochs deterministically
laserkelvin May 23, 2026
294b905
fix(training): align step targets with optimizer updates
laserkelvin May 23, 2026
4cbbd33
feat(training): add MixedPrecisionHook
laserkelvin May 11, 2026
9316934
test(training): extract shared training test fixtures to conftest
laserkelvin May 12, 2026
cebaa1b
fix(training): align AMP unscale with optimizer steps
laserkelvin May 20, 2026
627b16a
docs(training): document mixed precision hooks
laserkelvin May 20, 2026
ee9c3e0
fix(training): narrow AMP autocast scope
laserkelvin May 20, 2026
75ea950
refactor(training): dispatch mixed precision hook stages
laserkelvin May 20, 2026
c63e3c1
test(training): align mixed precision tests with train batch helper
laserkelvin May 22, 2026
8b75aad
fix(training): prevent duplicate mixed precision hooks
laserkelvin May 22, 2026
2396c0c
test(training): align update hook API expectations
laserkelvin May 26, 2026
14580ac
Merge pull request #9 from laserkelvin/feat-training-update-orchestrator
laserkelvin May 27, 2026
b9aa80b
Merge remote-tracking branch 'fork/training-epic' into feat-mixed-pre…
laserkelvin May 27, 2026
ad7ba4c
test: consolidating and using existing device fixture
laserkelvin May 28, 2026
92eaa19
Merge pull request #7 from laserkelvin/feat-mixed-precision-hook
laserkelvin May 28, 2026
c85c44f
Merge branch 'feat-training-update-orchestrator' into feat-ema-hook
laserkelvin May 28, 2026
03b307b
Merge remote-tracking branch 'fork/training-epic' into feat-ema-hook
laserkelvin May 28, 2026
cdbe62e
Merge remote-tracking branch 'fork/training-epic' into feat-ema-hook
laserkelvin May 28, 2026
6b810c1
feat(training): add strategy checkpoint restart loading
laserkelvin May 28, 2026
e441123
fix(training): restore checkpoint restart consistency
laserkelvin May 28, 2026
690b5d3
docs(training): note checkpoint restart workflow
laserkelvin May 28, 2026
8acb8b2
Merge pull request #8 from laserkelvin/feat-ema-hook
laserkelvin May 29, 2026
9720c5b
fix(data): generate edge rows in io benchmark
laserkelvin May 29, 2026
af3095d
refactor(data): profile io benchmark readback
laserkelvin May 29, 2026
a424720
refactor(data): batch zarr dataloader reads
laserkelvin May 29, 2026
01bc4f3
feat(data): compare zarr readback modes
laserkelvin May 30, 2026
e1a23e8
docs(data): document zarr readback modes
laserkelvin May 30, 2026
849adf0
docs(data): refresh zarr benchmark examples
laserkelvin May 30, 2026
eb51e24
Merge remote-tracking branch 'origin/main' into training-epic
laserkelvin May 30, 2026
e774991
Merge remote-tracking branch 'fork/training-epic' into feat-checkpoin…
laserkelvin May 31, 2026
35f76ee
feat(training): add periodic checkpoint hook
laserkelvin Jun 2, 2026
def6893
fix(training): respect checkpoint hook lifecycle
laserkelvin Jun 2, 2026
2b64eac
fix(training): make checkpoint hook cadence explicit
laserkelvin Jun 2, 2026
03b5b8e
refactor: simplifying mutual exclusion
laserkelvin Jun 2, 2026
5263c85
test(training): cover checkpoint hook restart cycles
laserkelvin Jun 2, 2026
007f473
feat(training): add strategy checkpoint helpers
laserkelvin Jun 2, 2026
ff64226
docs: adding explicit note about hook state persistence
laserkelvin Jun 2, 2026
63429bb
feat(data): benchmark shuffled zarr readback
laserkelvin Jun 2, 2026
6bf3e79
refactor(training): rename loss classes and harmonize ignore_nonfinite
laserkelvin Jun 3, 2026
144e0e8
fix(training): align EnergyMAELoss per_atom reduction with atom-count…
laserkelvin Jun 3, 2026
a8a115a
refactor(training): extract template-method pattern from BaseLossFunc…
laserkelvin Jun 3, 2026
2dfb7a2
docs(training): document custom mask, reduce, and plum dispatch patterns
laserkelvin Jun 3, 2026
67ee763
docs(skills): add nvalchemi-loss-api agent skill
laserkelvin Jun 3, 2026
cd1e9d5
docs(training): document distributed checkpoint semantics
laserkelvin Jun 3, 2026
6159191
Merge pull request #12 from laserkelvin/feat-checkpoint-loading
laserkelvin Jun 3, 2026
36bee78
Merge remote-tracking branch 'origin/main' into training-epic
laserkelvin Jun 3, 2026
46ed09a
Merge remote-tracking branch 'fork/training-epic' into training-epic
laserkelvin Jun 3, 2026
782d3dd
Merge remote-tracking branch 'fork/training-epic' into feat-mae-l2-lo…
laserkelvin Jun 3, 2026
b80e5fb
fix(training): update merged test files with renamed loss classes
laserkelvin Jun 3, 2026
7b1b0a5
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 3, 2026
b2165c5
Merge pull request #6 from laserkelvin/feat-mae-l2-loss-terms
laserkelvin Jun 3, 2026
fe7124e
feat(data): add read-only subcommand to nvalchemi-io-test CLI
laserkelvin Jun 4, 2026
d11c304
feat(training): add distributed manager DDP support
laserkelvin Jun 4, 2026
99e3a70
docs(training): add distributed manager guide
laserkelvin Jun 4, 2026
b901f9e
fix(training): unwrap DDP models for checkpoints
laserkelvin Jun 4, 2026
8f82416
fix(training): avoid duplicating manager in train context
laserkelvin Jun 4, 2026
16c01a0
fix(training): keep dataloader on strategy workflow
laserkelvin Jun 4, 2026
85891dc
feat(training): generalize DDP sampler configuration
laserkelvin Jun 4, 2026
c92b754
refactor: adding batch method from raw dicts
laserkelvin Jun 4, 2026
88ac068
refactor: adding batch method from raw dicts
laserkelvin Jun 4, 2026
938c253
test: adding unit tests for mega prefetch
laserkelvin Jun 4, 2026
30be8e0
refactor: modifying dataset and dataloader to work with megaprefetching
laserkelvin Jun 4, 2026
891f88f
docs(training): add DDP MLP example
laserkelvin Jun 4, 2026
3d1e779
fix(training): initialize DDP example from env
laserkelvin Jun 4, 2026
351fb2e
fix(training): avoid env reads in DDP example
laserkelvin Jun 4, 2026
59cb7c3
refactor(data): review fixes, double-buffer prefetch, read amplificat…
laserkelvin Jun 4, 2026
4736df6
docs(training): improve DDP example pedagogy
laserkelvin Jun 4, 2026
aa3bee4
docs: adding documentation on zarr perf tuning
laserkelvin Jun 4, 2026
5bef612
docs: updating agent skills to include zarr perf tuning
laserkelvin Jun 4, 2026
487cded
feat(training): add evaluation runtime plumbing
laserkelvin May 29, 2026
95bda5f
feat(training): add evaluate hook
laserkelvin May 29, 2026
a4c4072
test(training): cover evaluate hook
laserkelvin May 29, 2026
00cf563
perf(data): propagate non-blocking batch transfers
laserkelvin May 29, 2026
b5613a7
fix(training): harden evaluate hook runtime
laserkelvin May 29, 2026
4d0a377
feat(training): add evaluation zarr sink
laserkelvin May 30, 2026
7a8ec87
feat(training): stream evaluation outputs to sinks
laserkelvin May 30, 2026
0327b9a
test(training): cover evaluation sink outputs
laserkelvin May 30, 2026
e80a1b7
fix(training): harden evaluation sink writes
laserkelvin Jun 4, 2026
82cbb2e
fix(training): integrate evaluation with distributed manager
laserkelvin Jun 4, 2026
fafc2ed
fix(training): simplify DDP sampler injection
laserkelvin Jun 5, 2026
9946cb2
fix(data): propagate field-level metadata through skip_validation path
laserkelvin Jun 5, 2026
f0cbd6b
test(data): add coverage for field_levels in from_raw_dicts and Zarr …
laserkelvin Jun 5, 2026
03d695c
Merge pull request #14 from laserkelvin/feat-distributed-manager
laserkelvin Jun 5, 2026
e39dad8
perf(data): optimize shuffled Zarr reads
laserkelvin Jun 5, 2026
78a1900
refactor(data): simplify fused prefetch loader API
laserkelvin Jun 5, 2026
a17edd8
refactor(data): clarify reader batch loading hooks
laserkelvin Jun 6, 2026
21c3d23
docs(data): explain reader batch loading pipeline
laserkelvin Jun 6, 2026
2e608fe
docs(data): refresh Zarr read tuning guide
laserkelvin Jun 6, 2026
563bd60
docs(data): update Zarr performance agent skill
laserkelvin Jun 6, 2026
1a0a8ff
docs(data): refresh datapipes API guide
laserkelvin Jun 6, 2026
f47ef03
Merge pull request #13 from laserkelvin/fix-io-edge-roundtrip-profiling
laserkelvin Jun 6, 2026
bac9baf
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 7, 2026
5e422ae
feat(data): compose PhysicsNeMo datapipes
laserkelvin Jun 7, 2026
fd1f1a9
refactor(data): route multidataset batching
laserkelvin Jun 7, 2026
7920f29
feat(training): add ComposedLossFunction.requires_eval_grad
laserkelvin Jun 8, 2026
58c241e
fix(data): tighten multidataset batching semantics
laserkelvin Jun 8, 2026
7dc4373
test(data): cover multidataset sampler policies
laserkelvin Jun 8, 2026
21cd8d9
feat(training): support metric-driven LR schedulers
laserkelvin Jun 8, 2026
dee73bb
feat(training): make validation first-class on TrainingStrategy
laserkelvin Jun 8, 2026
5b3d55a
refactor(data): clarify multidataset public APIs
laserkelvin Jun 8, 2026
9349155
refactor(training)!: remove EvaluateHook in favor of first-class vali…
laserkelvin Jun 8, 2026
a84b275
feat(data): add multidataset epoch policies
laserkelvin Jun 8, 2026
000968e
refactor(data): extract multidataset route plans
laserkelvin Jun 8, 2026
a0d406a
refactor(data): collapse batch loading API
laserkelvin Jun 8, 2026
2abc585
feat(hooks): add reporting orchestrator
laserkelvin Jun 3, 2026
d517386
feat(hooks): add scalar jsonl reporting
laserkelvin Jun 3, 2026
d2dacaf
feat(hooks): add tensorboard reporting
laserkelvin Jun 4, 2026
d506759
feat(hooks): add rich reporting dashboards
laserkelvin Jun 7, 2026
bb22fad
refactor(hooks): split rich reporting layouts
laserkelvin Jun 7, 2026
f9ce176
feat(hooks): improve rich dynamics layouts
laserkelvin Jun 7, 2026
cd459c4
docs(hooks): expand reporting guide
laserkelvin Jun 7, 2026
114373a
feat(hooks): complete rich reporting dashboards
laserkelvin Jun 7, 2026
3dfbb95
docs(data): document multidataset batch loading
laserkelvin Jun 8, 2026
d1f7161
refactor(data): narrow dataset batch API
laserkelvin Jun 8, 2026
66fa626
fix(hooks): harden reporting reductions
laserkelvin Jun 8, 2026
66e7dd5
docs(data): add multidataset changelog entry
laserkelvin Jun 8, 2026
79e5fcc
feat(training): add AFTER_VALIDATION stage and runtime optimizer record
laserkelvin Jun 8, 2026
c161bd0
test(training): add ValidationLoop standalone and distributed test su…
laserkelvin Jun 8, 2026
1ff1c03
docs(training): document first-class validation API
laserkelvin Jun 8, 2026
0dd9f08
Merge remote-tracking branch 'origin/main' into training-epic
laserkelvin Jun 8, 2026
cd7d699
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 8, 2026
3ee6770
refactor(hooks): drop jsonl reporting interface
laserkelvin Jun 8, 2026
0599c83
test(hooks): consolidate reporting test helpers
laserkelvin Jun 8, 2026
9b2deb8
docs(examples): add rich training reporting demo
laserkelvin Jun 8, 2026
79319e4
refactor(training): replace evaluation sinks with optional batch_call…
laserkelvin Jun 9, 2026
21287c9
docs(training): add validation guide with batch_callback escape hatch
laserkelvin Jun 9, 2026
411092a
test: removing unnecessary test
laserkelvin Jun 9, 2026
1762f4e
refactor(data): fold balanced multidataset sampler into constructor
laserkelvin Jun 9, 2026
2df1950
add unweighted component loss to allow monitoring during validation
ys-teh Jun 6, 2026
e198a20
remove weighted component loss
ys-teh Jun 9, 2026
4a188db
Merge pull request #19 from ys-teh/feature/unweighted-comp-loss-tracking
laserkelvin Jun 9, 2026
a375ecd
Merge remote-tracking branch 'fork/training-epic' into feat-evaluatio…
laserkelvin Jun 9, 2026
af497ee
Merge pull request #18 from laserkelvin/feat-evaluation-hook
laserkelvin Jun 10, 2026
53d060d
Update TrainingStage setup test expectations
laserkelvin Jun 10, 2026
2a2ee26
Add distributed multidataset sampler support
laserkelvin Jun 10, 2026
748373d
Document distributed datapipe sampler workflows
laserkelvin Jun 10, 2026
64fb277
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 10, 2026
08aee2a
Merge remote-tracking branch 'origin/main' into training-epic
laserkelvin Jun 10, 2026
6b68d88
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 10, 2026
7657072
feat(training): checkpoint hook runtime state
laserkelvin Jun 10, 2026
2d00c3d
fix(training): clear stale EMA checkpoint state
laserkelvin Jun 10, 2026
42ecac2
test(training): cover EMA checkpoint restarts
laserkelvin Jun 10, 2026
1dc81d5
docs(training): explain restartable hooks
laserkelvin Jun 10, 2026
a548731
add huber loss for energy, force, and stress terms
ys-teh Jun 10, 2026
f082173
Merge pull request #21 from ys-teh/feature/huber_loss
laserkelvin Jun 10, 2026
8ebe1b1
Merge fork/training-epic into fix-checkpointable-hooks
laserkelvin Jun 10, 2026
b715033
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 10, 2026
b77b202
Merge pull request #20 from laserkelvin/fix-checkpointable-hooks
laserkelvin Jun 10, 2026
ad59acb
Use unweighted validation component losses
ys-teh Jun 11, 2026
3b012c4
move inference_model to primary device in set_inference_model
ys-teh Jun 11, 2026
32f527f
docs: updating documentation with profile refator and implementation
laserkelvin Jun 11, 2026
69b5c67
Add shared profiling hooks
laserkelvin Jun 11, 2026
4026df2
Replace dynamics profiler hook with compatibility shim
laserkelvin Jun 11, 2026
0c74455
Merge pull request #22 from ys-teh/fix/validation-fixes-to-training-epic
laserkelvin Jun 11, 2026
e311a4c
Centralize distributed rank helpers
laserkelvin Jun 11, 2026
25625e0
fix dataloader custom field batching
laserkelvin Jun 11, 2026
c13037c
test validated custom dataloader fields
laserkelvin Jun 11, 2026
145c1b8
docs update dataloader changelog
laserkelvin Jun 11, 2026
b4523bc
Merge pull request #24 from laserkelvin/exp-training-dataloader-custo…
laserkelvin Jun 11, 2026
bf95db4
Fix EMA tensor device restoration
laserkelvin Jun 11, 2026
c9ee6cf
Add EMA device restoration coverage
laserkelvin Jun 11, 2026
0a6e675
Test MACE EMA checkpoint roundtrip
laserkelvin Jun 11, 2026
fcb39c4
Route torchvision through CUDA indexes
laserkelvin Jun 11, 2026
115a16f
Remove torchvision fake op patch from MACE tests
laserkelvin Jun 11, 2026
851d5a2
Use strategy checkpoint path for MACE EMA test
laserkelvin Jun 11, 2026
56b0c29
Support callable model specs for MACE checkpoints
laserkelvin Jun 12, 2026
d9043f2
Clarify MACE EMA strategy checkpoint test
laserkelvin Jun 12, 2026
d5f0946
Document EMA checkpoint reconstruction fix
laserkelvin Jun 12, 2026
cd38cb0
Merge remote-tracking branch 'fork/training-epic' into cueq-mace-fix
laserkelvin Jun 12, 2026
c9fa621
Publish restored EMA before validation
laserkelvin Jun 12, 2026
5dd9d76
Initialize EMA during training setup
laserkelvin Jun 12, 2026
0f51499
match tensor dtype in ema
ys-teh Jun 12, 2026
210a4a7
Clarify EMA stage dispatch
laserkelvin Jun 12, 2026
7454cb4
Merge pull request #26 from ys-teh/fix/ema_dype
laserkelvin Jun 12, 2026
14ce0cc
Merge pull request #25 from laserkelvin/cueq-mace-fix
laserkelvin Jun 12, 2026
d077e7f
update pipeline to be compatible with ema
ys-teh Jun 12, 2026
c9349ab
Merge remote-tracking branch 'fork/training-epic' into multi-dataset-…
laserkelvin Jun 12, 2026
1a50571
Merge pull request #17 from laserkelvin/multi-dataset-support
laserkelvin Jun 12, 2026
b9254ba
Merge pull request #23 from laserkelvin/feat-physicsnemo-profiler-hook
laserkelvin Jun 12, 2026
b9855ee
Merge fork/training-epic into feat-reporting-abstraction
laserkelvin Jun 12, 2026
e032012
fix(hooks): align reporting loss component keys
laserkelvin Jun 12, 2026
c0b8907
Merge pull request #16 from laserkelvin/feat-reporting-abstraction
laserkelvin Jun 12, 2026
0e3b818
Merge pull request #28 from ys-teh/fix/pipeline-compatiblity-w-ema
laserkelvin Jun 12, 2026
6a9a7da
fix bug on unweighted loss reporting
ys-teh Jun 13, 2026
17cd3b5
add fix
EricZQu Jun 13, 2026
e677a7e
Add ema build override site
EricZQu Jun 15, 2026
3d61b75
Update change log
EricZQu Jun 15, 2026
950614b
Merge pull request #29 from ys-teh/fix/unweighted-loss-reporting
laserkelvin Jun 15, 2026
987f778
Merge pull request #30 from EricZQu/ngnp-integration-feedback
laserkelvin Jun 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ concise instructions on how to use the `nvalchemi` API for elementary
tasks.

- `nvalchemi-data-structures`: how to use individual atomic systems as well as batches.
- `nvalchemi-data-storage`: how to write and read atomic data.
- `nvalchemi-data-storage`: how to write, read, compose, and load atomic data.
- `nvalchemi-zarr-perf`: how to tune Zarr-backed Dataset/DataLoader throughput.
- `nvalchemi-model-wrapping`: how to wrap MLIPs to use arbitrary models within `nvalchemi`.
- `nvalchemi-dynamics-implementation`: how to implement a simple dynamics class.
- `nvalchemi-dynamics-hooks`: how to implement and use `Hook`s in dynamics.
Expand Down
78 changes: 68 additions & 10 deletions .claude/skills/nvalchemi-data-storage/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
---
name: nvalchemi-data-storage
description: How to write, read, and load atomic data using nvalchemi's composable Zarr-backed storage pipeline (Writer, Reader, Dataset, DataLoader).
description: >-
How to write, read, compose, and load atomic data using nvalchemi's
composable Zarr-backed storage pipeline (Writer, Reader, Dataset,
MultiDataset, DataLoader).
---

# nvalchemi Data Storage
Expand All @@ -9,31 +12,36 @@ description: How to write, read, and load atomic data using nvalchemi's composab

`nvalchemi` provides a composable pipeline for persisting and loading atomic data:

```
```text
Writer Reader
(AtomicData/Batch -> Zarr) (Zarr -> dict[str, Tensor])
|
Dataset
(dict -> AtomicData, device transfer, prefetch)
(dict -> AtomicData, load_batches, prefetch)
|
optional MultiDataset composition
|
DataLoader
(AtomicData -> Batch, batching, iteration)
(Batch iteration)
```

```python
from nvalchemi.data.datapipes import (
AtomicDataZarrWriter,
AtomicDataZarrReader,
Dataset,
MultiDataset,
DataLoader,
MultiDatasetBatchSampler,
)
```

---

## Writing Data

`AtomicDataZarrWriter` serializes `AtomicData`, `list[AtomicData]`, or `Batch` into a Zarr store.
`AtomicDataZarrWriter` serializes `AtomicData`, `list[AtomicData]`, or
`Batch` into a Zarr store.

```python
from nvalchemi.data import AtomicData, Batch
Expand Down Expand Up @@ -82,7 +90,7 @@ writer.defragment() # rebuild store without deleted samples

### Zarr store layout

```
```text
dataset.zarr/
├── meta/
│ ├── atoms_ptr # int64 [N+1] — cumulative node counts
Expand Down Expand Up @@ -144,6 +152,10 @@ atomic_data, metadata = ds[0] # AtomicData on target device
# Lightweight metadata (no full construction)
num_atoms, num_edges = ds.get_metadata(0)

# Explicit batch loading. This is the canonical synchronous batch API.
batches = ds.load_batches([[0, 3, 2], [4, 1, 5]])
batch0 = batches[0]

len(ds) # number of samples
ds.close()

Expand Down Expand Up @@ -178,20 +190,23 @@ Iterates over a `Dataset` in batches, producing `Batch` objects.
```python
from nvalchemi.data.datapipes import AtomicDataZarrReader, Dataset, DataLoader

reader = AtomicDataZarrReader("dataset.zarr")
ds = Dataset(reader, device="cuda", num_workers=4)
reader = AtomicDataZarrReader("dataset.zarr", pin_memory=True)
ds = Dataset(reader, device="cuda", num_workers=1)

loader = DataLoader(
ds,
batch_size=32,
shuffle=True,
drop_last=False,
sampler=None, # optional torch Sampler
prefetch_factor=2, # batches to prefetch ahead
num_streams=4, # CUDA streams for prefetching
prefetch_factor=16, # fuse 16 batches per read_many call
num_streams=2, # CUDA streams for prefetching
use_streams=True, # enable stream prefetching
)

# For throughput tuning (skip_validation, prefetch_factor, chunk/shard
# sizing), load the nvalchemi-zarr-perf agent skill.

for batch in loader:
# batch is a Batch with concatenated tensors on target device
print(batch.num_graphs, batch.num_nodes)
Expand All @@ -200,6 +215,45 @@ len(loader) # number of batches
loader.set_epoch(epoch) # for distributed sampler
```

Use `prefetch_factor=0` to disable async fused prefetch while still reading each
emitted batch through `Dataset.load_batches([indices])`. For explicit/manual
batch reads, use `load_batches(...)`.

### Composing multiple datasets

Use `MultiDataset` to concatenate multiple `Dataset` instances behind one global
index space while keeping the same `load_batches(...)` fast path:

```python
from nvalchemi.data.datapipes import (
AtomicDataZarrReader,
DataLoader,
Dataset,
MultiDataset,
MultiDatasetBatchSampler,
)

ds_a = Dataset(AtomicDataZarrReader("dataset_a.zarr"), device="cuda")
ds_b = Dataset(AtomicDataZarrReader("dataset_b.zarr"), device="cuda")
dataset = MultiDataset(ds_a, ds_b, output_strict=True)

batch_sampler = MultiDatasetBatchSampler.balanced(
dataset,
batch_size=64,
epoch_policy="max_size", # oversample smaller datasets when replacement=True
replacement=True,
)

loader = DataLoader(dataset, batch_sampler=batch_sampler, prefetch_factor=16)
```

Sampler notes:

- `samples_per_dataset` accepts integer counts or float ratios.
- `epoch_policy="min_size"` stops at the smallest contributing dataset.
- `epoch_policy="max_size"` covers the largest dataset and oversamples smaller
datasets when `replacement=True`.

---

## Custom Readers
Expand All @@ -218,6 +272,10 @@ class MyReader(Reader):
"""Load raw tensor dict for a single sample."""
...

def _load_many_samples(self, indices) -> list[dict[str, torch.Tensor]]:
"""Optional fast path for coalesced batch reads."""
...

def __len__(self) -> int:
"""Total number of samples."""
...
Expand Down
24 changes: 17 additions & 7 deletions .claude/skills/nvalchemi-data-structures/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
---
name: nvalchemi-data-structures
description: How to use AtomicData and Batch — the core graph-based data structures for representing atomic systems and batching them for GPU computation.
description: >-
How to use AtomicData and Batch, the core graph-based data structures for
representing atomic systems and batching them for GPU computation.
---

# nvalchemi Data Structures
Expand All @@ -10,7 +12,8 @@ description: How to use AtomicData and Batch — the core graph-based data struc
`nvalchemi` represents atomic systems as graphs using two core classes:

- **`AtomicData`** — a single atomic system (molecule, crystal, etc.)
- **`Batch`** — an efficient container of multiple `AtomicData` objects stored as concatenated tensors
- **`Batch`** — an efficient container of multiple `AtomicData` objects
stored as concatenated tensors

Both are Pydantic `BaseModel` subclasses with `DataMixin` for device/dtype operations.

Expand Down Expand Up @@ -274,7 +277,10 @@ batch.model_dump_json() # JSON string

### Distributed communication

`Batch` supports point-to-point distributed communication via `torch.distributed`. Data is sent in three phases: a metadata header (`num_graphs`, `num_nodes`, `num_edges`), per-group segment lengths, and bulk tensor data.
`Batch` supports point-to-point distributed communication via
`torch.distributed`. Data is sent in three phases: a metadata header
(`num_graphs`, `num_nodes`, `num_edges`), per-group segment lengths,
and bulk tensor data.

**Blocking send/recv:**

Expand Down Expand Up @@ -304,10 +310,14 @@ received = handle.wait() # block until data arrives, returns Batch

**Key details:**

- `template` is required on the receiver to know the attribute keys, dtypes, and group structure (atoms/edges/system). Cache it across calls.
- A 0-graph (sentinel) batch can be sent/received — only the metadata header is transmitted.
- `tag` is a base tag; it is incremented internally per group. Use distinct base tags for concurrent send/recv pairs.
- `empty_like(batch)` creates a 0-graph batch with the same schema — useful for sentinel signals.
- `template` is required on the receiver to know the attribute keys,
dtypes, and group structure (atoms/edges/system). Cache it across calls.
- A 0-graph sentinel batch can be sent or received. Only the metadata
header is transmitted.
- `tag` is a base tag incremented internally per group. Use distinct
base tags for concurrent send/recv pairs.
- `empty_like(batch)` creates a 0-graph batch with the same schema, which
is useful for sentinel signals.

```python
sentinel = Batch.empty_like(batch, device="cuda") # 0-graph, same schema
Expand Down
51 changes: 35 additions & 16 deletions .claude/skills/nvalchemi-dynamics-api/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,11 +215,17 @@ stage = DemoDynamics(
)
```

The default `comm_mode` is `"async_recv"`. The three modes differ in when blocking occurs:

- `"sync"`: `irecv` completes inline in `_prestep_sync_buffers` — simplest, good for debugging
- `"async_recv"`: `irecv` is posted in `_prestep_sync_buffers` but `wait()` deferred to `_complete_pending_recv` — allows compute/communication overlap
- `"fully_async"`: both send and receive are deferred — maximum overlap, highest throughput; pending sends from the previous step are drained at the start of the next `_prestep_sync_buffers`
The default `comm_mode` is `"async_recv"`. The three modes differ in when
blocking occurs:

- `"sync"`: `irecv` completes inline in `_prestep_sync_buffers`; simplest
and good for debugging.
- `"async_recv"`: `irecv` is posted in `_prestep_sync_buffers`, but
`wait()` is deferred to `_complete_pending_recv` for communication
overlap.
- `"fully_async"`: send and receive are both deferred for maximum
overlap. Pending sends from the prior step are drained at the start of
the next `_prestep_sync_buffers`.

### Pre-allocated buffers

Expand All @@ -240,9 +246,12 @@ stage = DemoDynamics(
)
```

Buffers are **lazily initialized** on the first step using the first concrete batch as a template for attribute keys, dtypes, and shapes. This means the first step has slightly more overhead.
Buffers are **lazily initialized** on the first step using the first
concrete batch as a template for attribute keys, dtypes, and shapes.
This means the first step has slightly more overhead.

Adjacent stages must use identical `BufferConfig` values — this is validated in `DistributedPipeline.setup()`.
Adjacent stages must use identical `BufferConfig` values. This is
validated in `DistributedPipeline.setup()`.

---

Expand All @@ -262,20 +271,27 @@ The dynamics framework manages data flow through three layers:

Each pipeline step follows a four-phase protocol:

1. `_prestep_sync_buffers()` — zeros send buffer, posts `irecv` from prior rank
2. `_complete_pending_recv()` — waits on deferred recv, routes into active batch, drains overflow sinks
3. `step()` — dynamics integration
4. `_poststep_sync_buffers(converged_indices)` — extracts converged into send buffer, sends to next rank
1. `_prestep_sync_buffers()` zeros the send buffer and posts `irecv`
from the prior rank.
2. `_complete_pending_recv()` waits on deferred receive, routes into
the active batch, and drains overflow sinks.
3. `step()` runs dynamics integration.
4. `_poststep_sync_buffers(converged_indices)` extracts converged
samples into the send buffer and sends them to the next rank.

**Deadlock prevention:** when no samples converge, an empty send buffer is still sent so the downstream `irecv` completes.
**Deadlock prevention:** when no samples converge, an empty send buffer
is still sent so the downstream `irecv` completes.

### Back-pressure

When `send_buffer` has limited capacity (via `BufferConfig`):

- Only `min(converged_count, remaining_capacity)` samples are extracted
- Excess converged samples remain in the active batch as **no-ops** — their positions/velocities are saved before the integrator and restored after
- Without `BufferConfig`, all converged samples are sent without constraints (backward compat)
- Excess converged samples remain in the active batch as **no-ops**.
Their positions and velocities are saved before the integrator and
restored after it runs.
- Without `BufferConfig`, all converged samples are sent without
constraints (backward compatible).

### Buffer lifecycle: put/defrag/zero

Expand All @@ -294,7 +310,9 @@ src_batch.defrag()
buffer.zero()
```

**Important:** `Batch.put()` uses Warp GPU kernels that only handle float32 attributes. Adjacent pipeline stages must have identical `BufferConfig` values.
**Important:** `Batch.put()` uses Warp GPU kernels that only handle
float32 attributes. Adjacent pipeline stages must have identical
`BufferConfig` values.

### Data routing methods

Expand Down Expand Up @@ -348,7 +366,8 @@ When `refill_frequency` triggers (every N steps), `_refill_check()`:
5. Appends replacements via `Batch.append`
6. Rebuilds `status` (replacements get `0`) and `fmax` (replacements get `inf`) tensors

This produces a **new** `Batch` object (not in-place mutation). Returns `None` when the sampler is exhausted and no active samples remain.
This produces a **new** `Batch` object, not an in-place mutation. It
returns `None` when the sampler is exhausted and no active samples remain.

### With FusedStage

Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/nvalchemi-dynamics-implementation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ from nvalchemi.data import Batch

Each call to `step(batch)` executes:

```
```text
1. BEFORE_STEP hooks
2. BEFORE_PRE_UPDATE hooks → pre_update(batch) → AFTER_PRE_UPDATE hooks
3. BEFORE_COMPUTE hooks → compute(batch) → AFTER_COMPUTE hooks
Expand Down
Loading