NVIDIA · laserkelvin · May 13, 2026 · May 13, 2026 · May 13, 2026 · May 13, 2026
diff --git a/.claude/skills/README.md b/.claude/skills/README.md
@@ -6,7 +6,8 @@ concise instructions on how to use the `nvalchemi` API for elementary
 tasks.
 
 - `nvalchemi-data-structures`: how to use individual atomic systems as well as batches.
-- `nvalchemi-data-storage`: how to write and read atomic data.
+- `nvalchemi-data-storage`: how to write, read, compose, and load atomic data.
+- `nvalchemi-zarr-perf`: how to tune Zarr-backed Dataset/DataLoader throughput.
 - `nvalchemi-model-wrapping`: how to wrap MLIPs to use arbitrary models within `nvalchemi`.
 - `nvalchemi-dynamics-implementation`: how to implement a simple dynamics class.
 - `nvalchemi-dynamics-hooks`: how to implement and use `Hook`s in dynamics.

diff --git a/.claude/skills/nvalchemi-data-storage/SKILL.md b/.claude/skills/nvalchemi-data-storage/SKILL.md
@@ -1,6 +1,9 @@
 ---
 name: nvalchemi-data-storage
-description: How to write, read, and load atomic data using nvalchemi's composable Zarr-backed storage pipeline (Writer, Reader, Dataset, DataLoader).
+description: >-
+  How to write, read, compose, and load atomic data using nvalchemi's
+  composable Zarr-backed storage pipeline (Writer, Reader, Dataset,
+  MultiDataset, DataLoader).
 ---
 
 # nvalchemi Data Storage
@@ -9,31 +12,36 @@ description: How to write, read, and load atomic data using nvalchemi's composab
 
 `nvalchemi` provides a composable pipeline for persisting and loading atomic data:
 
-```
+```text
 Writer                          Reader
 (AtomicData/Batch -> Zarr)      (Zarr -> dict[str, Tensor])
                                     |
                                 Dataset
-                                (dict -> AtomicData, device transfer, prefetch)
+                                (dict -> AtomicData, load_batches, prefetch)
+                                    |
+                    optional MultiDataset composition
                                     |
                                 DataLoader
-                                (AtomicData -> Batch, batching, iteration)
+                                (Batch iteration)
 ```
 
 ```python
 from nvalchemi.data.datapipes import (
     AtomicDataZarrWriter,
     AtomicDataZarrReader,
     Dataset,
+    MultiDataset,
     DataLoader,
+    MultiDatasetBatchSampler,
 )
 ```
 
 ---
 
 ## Writing Data
 
-`AtomicDataZarrWriter` serializes `AtomicData`, `list[AtomicData]`, or `Batch` into a Zarr store.
+`AtomicDataZarrWriter` serializes `AtomicData`, `list[AtomicData]`, or
+`Batch` into a Zarr store.
 
 ```python
 from nvalchemi.data import AtomicData, Batch
@@ -82,7 +90,7 @@ writer.defragment()      # rebuild store without deleted samples
 
 ### Zarr store layout
 
-```
+```text
 dataset.zarr/
 ├── meta/
 │   ├── atoms_ptr       # int64 [N+1] — cumulative node counts
@@ -144,6 +152,10 @@ atomic_data, metadata = ds[0]   # AtomicData on target device
 # Lightweight metadata (no full construction)
 num_atoms, num_edges = ds.get_metadata(0)
 
+# Explicit batch loading. This is the canonical synchronous batch API.
+batches = ds.load_batches([[0, 3, 2], [4, 1, 5]])
+batch0 = batches[0]
+
 len(ds)    # number of samples
 ds.close()
 
@@ -178,20 +190,23 @@ Iterates over a `Dataset` in batches, producing `Batch` objects.
 ```python
 from nvalchemi.data.datapipes import AtomicDataZarrReader, Dataset, DataLoader
 
-reader = AtomicDataZarrReader("dataset.zarr")
-ds = Dataset(reader, device="cuda", num_workers=4)
+reader = AtomicDataZarrReader("dataset.zarr", pin_memory=True)
+ds = Dataset(reader, device="cuda", num_workers=1)
 
 loader = DataLoader(
     ds,
     batch_size=32,
     shuffle=True,
     drop_last=False,
     sampler=None,              # optional torch Sampler
-    prefetch_factor=2,         # batches to prefetch ahead
-    num_streams=4,             # CUDA streams for prefetching
+    prefetch_factor=16,        # fuse 16 batches per read_many call
+    num_streams=2,             # CUDA streams for prefetching
     use_streams=True,          # enable stream prefetching
 )
 
+# For throughput tuning (skip_validation, prefetch_factor, chunk/shard
+# sizing), load the nvalchemi-zarr-perf agent skill.
+
 for batch in loader:
     # batch is a Batch with concatenated tensors on target device
     print(batch.num_graphs, batch.num_nodes)
@@ -200,6 +215,45 @@ len(loader)                    # number of batches
 loader.set_epoch(epoch)        # for distributed sampler
 ```
 
+Use `prefetch_factor=0` to disable async fused prefetch while still reading each
+emitted batch through `Dataset.load_batches([indices])`. For explicit/manual
+batch reads, use `load_batches(...)`.
+
+### Composing multiple datasets
+
+Use `MultiDataset` to concatenate multiple `Dataset` instances behind one global
+index space while keeping the same `load_batches(...)` fast path:
+
+```python
+from nvalchemi.data.datapipes import (
+    AtomicDataZarrReader,
+    DataLoader,
+    Dataset,
+    MultiDataset,
+    MultiDatasetBatchSampler,
+)
+
+ds_a = Dataset(AtomicDataZarrReader("dataset_a.zarr"), device="cuda")
+ds_b = Dataset(AtomicDataZarrReader("dataset_b.zarr"), device="cuda")
+dataset = MultiDataset(ds_a, ds_b, output_strict=True)
+
+batch_sampler = MultiDatasetBatchSampler.balanced(
+    dataset,
+    batch_size=64,
+    epoch_policy="max_size",  # oversample smaller datasets when replacement=True
+    replacement=True,
+)
+
+loader = DataLoader(dataset, batch_sampler=batch_sampler, prefetch_factor=16)
+```
+
+Sampler notes:
+
+- `samples_per_dataset` accepts integer counts or float ratios.
+- `epoch_policy="min_size"` stops at the smallest contributing dataset.
+- `epoch_policy="max_size"` covers the largest dataset and oversamples smaller
+  datasets when `replacement=True`.
+
 ---
 
 ## Custom Readers
@@ -218,6 +272,10 @@ class MyReader(Reader):
         """Load raw tensor dict for a single sample."""
         ...
 
+    def _load_many_samples(self, indices) -> list[dict[str, torch.Tensor]]:
+        """Optional fast path for coalesced batch reads."""
+        ...
+
     def __len__(self) -> int:
         """Total number of samples."""
         ...

diff --git a/.claude/skills/nvalchemi-data-structures/SKILL.md b/.claude/skills/nvalchemi-data-structures/SKILL.md
@@ -1,6 +1,8 @@
 ---
 name: nvalchemi-data-structures
-description: How to use AtomicData and Batch — the core graph-based data structures for representing atomic systems and batching them for GPU computation.
+description: >-
+  How to use AtomicData and Batch, the core graph-based data structures for
+  representing atomic systems and batching them for GPU computation.
 ---
 
 # nvalchemi Data Structures
@@ -10,7 +12,8 @@ description: How to use AtomicData and Batch — the core graph-based data struc
 `nvalchemi` represents atomic systems as graphs using two core classes:
 
 - **`AtomicData`** — a single atomic system (molecule, crystal, etc.)
-- **`Batch`** — an efficient container of multiple `AtomicData` objects stored as concatenated tensors
+- **`Batch`** — an efficient container of multiple `AtomicData` objects
+  stored as concatenated tensors
 
 Both are Pydantic `BaseModel` subclasses with `DataMixin` for device/dtype operations.
 
@@ -274,7 +277,10 @@ batch.model_dump_json()               # JSON string
 
 ### Distributed communication
 
-`Batch` supports point-to-point distributed communication via `torch.distributed`. Data is sent in three phases: a metadata header (`num_graphs`, `num_nodes`, `num_edges`), per-group segment lengths, and bulk tensor data.
+`Batch` supports point-to-point distributed communication via
+`torch.distributed`. Data is sent in three phases: a metadata header
+(`num_graphs`, `num_nodes`, `num_edges`), per-group segment lengths,
+and bulk tensor data.
 
 **Blocking send/recv:**
 
@@ -304,10 +310,14 @@ received = handle.wait()  # block until data arrives, returns Batch
 
 **Key details:**
 
-- `template` is required on the receiver to know the attribute keys, dtypes, and group structure (atoms/edges/system). Cache it across calls.
-- A 0-graph (sentinel) batch can be sent/received — only the metadata header is transmitted.
-- `tag` is a base tag; it is incremented internally per group. Use distinct base tags for concurrent send/recv pairs.
-- `empty_like(batch)` creates a 0-graph batch with the same schema — useful for sentinel signals.
+- `template` is required on the receiver to know the attribute keys,
+  dtypes, and group structure (atoms/edges/system). Cache it across calls.
+- A 0-graph sentinel batch can be sent or received. Only the metadata
+  header is transmitted.
+- `tag` is a base tag incremented internally per group. Use distinct
+  base tags for concurrent send/recv pairs.
+- `empty_like(batch)` creates a 0-graph batch with the same schema, which
+  is useful for sentinel signals.
 
 ```python
 sentinel = Batch.empty_like(batch, device="cuda")  # 0-graph, same schema

diff --git a/.claude/skills/nvalchemi-dynamics-api/SKILL.md b/.claude/skills/nvalchemi-dynamics-api/SKILL.md
@@ -215,11 +215,17 @@ stage = DemoDynamics(
 )
 ```
 
-The default `comm_mode` is `"async_recv"`. The three modes differ in when blocking occurs:
-
-- `"sync"`: `irecv` completes inline in `_prestep_sync_buffers` — simplest, good for debugging
-- `"async_recv"`: `irecv` is posted in `_prestep_sync_buffers` but `wait()` deferred to `_complete_pending_recv` — allows compute/communication overlap
-- `"fully_async"`: both send and receive are deferred — maximum overlap, highest throughput; pending sends from the previous step are drained at the start of the next `_prestep_sync_buffers`
+The default `comm_mode` is `"async_recv"`. The three modes differ in when
+blocking occurs:
+
+- `"sync"`: `irecv` completes inline in `_prestep_sync_buffers`; simplest
+  and good for debugging.
+- `"async_recv"`: `irecv` is posted in `_prestep_sync_buffers`, but
+  `wait()` is deferred to `_complete_pending_recv` for communication
+  overlap.
+- `"fully_async"`: send and receive are both deferred for maximum
+  overlap. Pending sends from the prior step are drained at the start of
+  the next `_prestep_sync_buffers`.
 
 ### Pre-allocated buffers
 
@@ -240,9 +246,12 @@ stage = DemoDynamics(
 )
 ```
 
-Buffers are **lazily initialized** on the first step using the first concrete batch as a template for attribute keys, dtypes, and shapes. This means the first step has slightly more overhead.
+Buffers are **lazily initialized** on the first step using the first
+concrete batch as a template for attribute keys, dtypes, and shapes.
+This means the first step has slightly more overhead.
 
-Adjacent stages must use identical `BufferConfig` values — this is validated in `DistributedPipeline.setup()`.
+Adjacent stages must use identical `BufferConfig` values. This is
+validated in `DistributedPipeline.setup()`.
 
 ---
 
@@ -262,20 +271,27 @@ The dynamics framework manages data flow through three layers:
 
 Each pipeline step follows a four-phase protocol:
 
-1. `_prestep_sync_buffers()` — zeros send buffer, posts `irecv` from prior rank
-2. `_complete_pending_recv()` — waits on deferred recv, routes into active batch, drains overflow sinks
-3. `step()` — dynamics integration
-4. `_poststep_sync_buffers(converged_indices)` — extracts converged into send buffer, sends to next rank
+1. `_prestep_sync_buffers()` zeros the send buffer and posts `irecv`
+   from the prior rank.
+2. `_complete_pending_recv()` waits on deferred receive, routes into
+   the active batch, and drains overflow sinks.
+3. `step()` runs dynamics integration.
+4. `_poststep_sync_buffers(converged_indices)` extracts converged
+   samples into the send buffer and sends them to the next rank.
 
-**Deadlock prevention:** when no samples converge, an empty send buffer is still sent so the downstream `irecv` completes.
+**Deadlock prevention:** when no samples converge, an empty send buffer
+is still sent so the downstream `irecv` completes.
 
 ### Back-pressure
 
 When `send_buffer` has limited capacity (via `BufferConfig`):
 
 - Only `min(converged_count, remaining_capacity)` samples are extracted
-- Excess converged samples remain in the active batch as **no-ops** — their positions/velocities are saved before the integrator and restored after
-- Without `BufferConfig`, all converged samples are sent without constraints (backward compat)
+- Excess converged samples remain in the active batch as **no-ops**.
+  Their positions and velocities are saved before the integrator and
+  restored after it runs.
+- Without `BufferConfig`, all converged samples are sent without
+  constraints (backward compatible).
 
 ### Buffer lifecycle: put/defrag/zero
 
@@ -294,7 +310,9 @@ src_batch.defrag()
 buffer.zero()
 ```
 
-**Important:** `Batch.put()` uses Warp GPU kernels that only handle float32 attributes. Adjacent pipeline stages must have identical `BufferConfig` values.
+**Important:** `Batch.put()` uses Warp GPU kernels that only handle
+float32 attributes. Adjacent pipeline stages must have identical
+`BufferConfig` values.
 
 ### Data routing methods
 
@@ -348,7 +366,8 @@ When `refill_frequency` triggers (every N steps), `_refill_check()`:
 5. Appends replacements via `Batch.append`
 6. Rebuilds `status` (replacements get `0`) and `fmax` (replacements get `inf`) tensors
 
-This produces a **new** `Batch` object (not in-place mutation). Returns `None` when the sampler is exhausted and no active samples remain.
+This produces a **new** `Batch` object, not an in-place mutation. It
+returns `None` when the sampler is exhausted and no active samples remain.
 
 ### With FusedStage
 

diff --git a/.claude/skills/nvalchemi-dynamics-implementation/SKILL.md b/.claude/skills/nvalchemi-dynamics-implementation/SKILL.md
@@ -22,7 +22,7 @@ from nvalchemi.data import Batch
 
 Each call to `step(batch)` executes:
 
-```
+```text
 1. BEFORE_STEP hooks
 2. BEFORE_PRE_UPDATE hooks  →  pre_update(batch)  →  AFTER_PRE_UPDATE hooks
 3. BEFORE_COMPUTE hooks     →  compute(batch)      →  AFTER_COMPUTE hooks