fix(neighbors): delegate batched method selection to ops by nikitafedik · Pull Request #106 · NVIDIA/nvalchemi-toolkit

nikitafedik · 2026-06-06T02:39:00Z

ALCHEMI Toolkit Pull Request

Description

Align Toolkit neighbor-list construction with current Toolkit-Ops dispatch:
compute_neighbors and NeighborListHook now pass batched metadata
(batch_idx / batch_ptr) without forcing an explicit neighbor-list method.
That lets Toolkit-Ops choose the correct batched strategy via method=None.

This pairs with an upstream Toolkit-Ops guard that rejects explicit
single-system methods such as method="naive" or method="cell_list" when
batched metadata is provided.

The observed failure mode is that forcing method="naive" can connect atoms
from different batched systems as neighbors. The model then treats those
cross-system edges like real neighbor interactions/messages.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Documentation update
Refactoring (no functional changes)
CI/CD or infrastructure change

Related Issues

Relates to neighbor-list batching reports where explicit unbatched Toolkit-Ops
methods can create cross-graph neighbor edges.

Changes Made

Removed Toolkit's internal explicit batched-method selector.
Updated compute_neighbors and NeighborListHook to leave method=None when calling Toolkit-Ops with batched metadata.
Removed pre-allocation/forwarding of algorithm-specific scratch kwargs from NeighborListHook; Toolkit-Ops chooses among geometry-dependent strategies at dispatch time.
Added graph-boundary regression coverage for compute_neighbors and strengthened NeighborListHook boundary assertions.
Added tests that verify Toolkit delegates method selection instead of passing stale explicit methods or scratch kwargs.
Updated CHANGELOG.md.

Testing

Unit tests pass locally (make pytest)
Linting passes (make lint)
New tests added for new functionality meets coverage expectations?

Ran locally:

WARP_CACHE_PATH=/tmp/warp-cache \
TRITON_CACHE_DIR=/tmp/triton-cache \
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor-cache \
TORCH_COMPILE_DISABLE=1 \
python -m pytest test/hooks/test_neighbor_list_hook.py -q

Result: 123 passed, 16 warnings.

ruff check nvalchemi/neighbors.py nvalchemi/hooks/neighbor_list.py test/hooks/test_neighbor_list_hook.py

Result: All checks passed!

Coverage added in test/hooks/test_neighbor_list_hook.py:

test_compute_neighbors_multi_graph_isolation verifies one-shot neighbor
construction does not create cross-graph neighbor entries.
test_multi_graph_isolation now checks every valid NeighborListHook
neighbor entry for src_system == dst_system.
test_compute_neighbors_delegates_method_selection verifies
compute_neighbors leaves method=None.
TestAllocNlKwargs verifies NeighborListHook does not forward stale
algorithm-specific scratch kwargs while Toolkit-Ops owns method selection.

Here "hooks" refers to Toolkit runtime hooks, not Git or CI hooks. CI exercises
them by running pytest; the tests instantiate and call NeighborListHook
directly.

Live H2O boundary probe against the patched Toolkit branch:

Toolkit compute_neighbors    total_edges= 16 cross_edges=  0 examples=[]
Toolkit NeighborListHook     total_edges= 16 cross_edges=  0 examples=[]
ops method=batch_naive       total_edges= 16 cross_edges=  0 examples=[]
ops method=naive             raises ValueError with upstream Toolkit-Ops guard
ops method=batch_cell_list   total_edges= 16 cross_edges=  0 examples=[]
ops method=cell_list         raises ValueError with upstream Toolkit-Ops guard
ops method=None              total_edges= 16 cross_edges=  0 examples=[]

Checklist

I have read and understand the Contributing Guidelines
I have updated the CHANGELOG.md
I have performed a self-review of my code
I have added docstrings to new functions/classes
I have updated the documentation (if applicable)

Additional Notes

This PR intentionally does not change the Toolkit public API. It updates
Toolkit's own neighbor-list callers to use Toolkit-Ops' official batched
auto-dispatch path.

The paired upstream Toolkit-Ops PR should land first or alongside this one so
direct explicit misuse fails loudly instead of silently treating a concatenated
batch as one system.

The current CONTRIBUTING.md says public direct contributions are not accepted
during the initial public beta, and signed-off commits are required.

copy-pr-bot · 2026-06-06T02:39:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-06T02:42:14Z

Greptile Summary

Fixes a correctness bug where NeighborListHook and compute_neighbors pre-allocated algorithm-specific scratch buffers (shift-range tensors, cell-list arrays) and inadvertently steered Toolkit-Ops toward single-system algorithms that could connect atoms across batch-graph boundaries. The fix is to stop passing those scratch kwargs entirely and let Toolkit-Ops auto-select the correct batched strategy via method=None.

_alloc_nl_kwargs is reduced to self._buf_nl_kwargs = {}, and the cell-list / naive-PBC scratch allocation logic is fully removed. _alloc_staging_buffers no longer accepts or uses batch_ptr since it was only needed by the old kwarg pre-allocation.
Tests in TestAllocNlKwargs are rewritten: the old assertions that expected specific pre-computed keys are replaced by assertions that _buf_nl_kwargs is empty; a new monkeypatched test verifies no stale scratch keys are forwarded.
_assert_no_cross_graph_neighbors is extracted as a shared helper and used to strengthen test_multi_graph_isolation, plus two new tests exercising compute_neighbors directly.

Important Files Changed

Filename	Overview
nvalchemi/hooks/neighbor_list.py	Removed algorithm-specific kwarg pre-allocation from _alloc_nl_kwargs; _alloc_staging_buffers signature simplified (batch_ptr removed); neighbor_list calls now pass no explicit method, delegating selection to Toolkit-Ops
test/hooks/test_neighbor_list_hook.py	TestAllocNlKwargs updated to verify empty kwargs; added _assert_no_cross_graph_neighbors helper; added multi-graph isolation and method-delegation tests for both NeighborListHook and compute_neighbors
CHANGELOG.md	Prepended entry for batched neighbor-list dispatch alignment fix

_{Reviews (3): Last reviewed commit: "fix(neighbors): delegate batched method ..." | Re-trigger Greptile}

greptile-apps · 2026-06-06T02:42:18Z

+    def test_compute_neighbors_multi_graph_isolation(self, device: str):
+        """compute_neighbors must not build neighbors across Batch graph boundaries."""
+        from nvalchemi.neighbors import compute_neighbors
+
+        batch = _line_batch(device, n_graphs=4)
+        compute_neighbors(batch, cutoff=_CUTOFF, max_neighbors=16)
+
+        _assert_no_cross_graph_neighbors(batch)
+
+    def test_compute_neighbors_passes_explicit_batched_method(
+        self, device: str, monkeypatch: pytest.MonkeyPatch
+    ):
+        """Toolkit should not rely on implicit Toolkit-Ops method selection."""
+        from nvalchemi import neighbors as neighbors_mod
+        from nvalchemi.neighbors import compute_neighbors
+
+        methods: list[str | None] = []
+
+        def fake_neighbor_list(**kwargs):
+            methods.append(kwargs.get("method"))
+            kwargs["num_neighbors"].zero_()
+
+        monkeypatch.setattr(neighbors_mod, "neighbor_list", fake_neighbor_list)
+
+        batch = _line_batch(device, n_graphs=4)
+        compute_neighbors(batch, cutoff=_CUTOFF, max_neighbors=16)
+
+        assert methods == ["batch_naive"]


New compute_neighbors tests placed in wrong class

test_compute_neighbors_multi_graph_isolation and test_compute_neighbors_passes_explicit_batched_method are appended to TestAdaptiveK, which is focused on neighbor-count overflow and shrinkage behaviour. Both tests cover graph-boundary isolation and explicit method dispatch — neither exercises the adaptive-K machinery. These would be easier to discover in a dedicated class, e.g. TestComputeNeighbors.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-06T02:42:19Z

            cutoff=cutoff,
            cell=cell,
            pbc=pbc,
+            method=_select_batched_neighbor_list_method(N, batch.num_graphs),


Method string recomputed on every overflow retry

_select_batched_neighbor_list_method(N, batch.num_graphs) is evaluated on each loop iteration, even though N and batch.num_graphs are invariant inside the while True block. Computing it once before the loop would make the intent clearer. Not a correctness issue.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

nikitafedik · 2026-06-06T02:46:32Z

Reproduced against nvalchemi-toolkit-ops main.

Ops main checkout:

803008406345cf9d8b5c48a5afc2653d0f29d2bf

I put the ops-main source checkout first on PYTHONPATH and ran the standalone script below. The forced unbatched methods still cross system boundaries; automatic dispatch and explicit batched methods do not.

Output:

nvalchemiops: /home/nfedik/projects/toolkit/batching-bug/runs/nvalchemi-toolkit-ops-main/nvalchemiops/__init__.py
torch device: cuda

method           total_edges cross_edges examples
--------------------------------------------------------------------------------
batch_naive               16           0 []
naive                    100          84 ['0->3 (system 0->1, O0->O0)', '0->4 (system 0->1, O0->H1)', '0->5 (system 0->1, O0->H2)', '0->6 (system 0->2, O0->O0)', '0->7 (system 0->2, O0->H1)']
batch_cell_list           16           0 []
cell_list                100          84 ['0->3 (system 0->1, O0->O0)', '0->4 (system 0->1, O0->H1)', '0->5 (system 0->1, O0->H2)', '0->6 (system 0->2, O0->O0)', '0->7 (system 0->2, O0->H1)']
None                      16           0 []

Script:

#!/usr/bin/env python3
"""Reproduce cross-system neighbor edges with forced unbatched NL methods.

Run against Toolkit-Ops main by putting the source checkout first on PYTHONPATH,
for example:

    WARP_CACHE_PATH=/tmp/warp-cache-batching-bug \
    PYTHONPATH=/path/to/nvalchemi-toolkit-ops-main \
    python repro_ops_main_neighbor_boundaries.py
"""

from __future__ import annotations

import os

os.environ.setdefault("WARP_CACHE_PATH", "/tmp/warp-cache-batching-bug")
os.environ.setdefault("XDG_CACHE_HOME", "/tmp/torch-cache-batching-bug")
os.environ.setdefault("TRITON_CACHE_DIR", "/tmp/triton-cache-batching-bug")
os.environ.setdefault("TORCHINDUCTOR_CACHE_DIR", "/tmp/torchinductor-cache-batching-bug")
for cache_dir in (
    os.environ["WARP_CACHE_PATH"],
    os.environ["XDG_CACHE_HOME"],
    os.environ["TRITON_CACHE_DIR"],
    os.environ["TORCHINDUCTOR_CACHE_DIR"],
):
    os.makedirs(cache_dir, exist_ok=True)

import torch
import nvalchemiops
from nvalchemiops.torch.neighbors import neighbor_list


def build_batched_waters(
    n_systems: int = 4,
    device: torch.device | str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, list[str]]:
    """Return identical water-like systems packed into one coordinate array."""
    one_water = torch.tensor(
        [
            [0.0, 0.0, 0.0],
            [0.968565, 0.0, 0.0],
            [-0.242, 0.928, 0.0],
        ],
        dtype=torch.float32,
        device=device,
    )
    positions = one_water.repeat(n_systems, 1)
    batch_idx = torch.repeat_interleave(
        torch.arange(n_systems, dtype=torch.int32, device=device), 3
    )
    batch_ptr = torch.arange(
        0, 3 * n_systems + 1, 3, dtype=torch.int32, device=device
    )
    atom_names = ["O0", "H1", "H2"] * n_systems
    return positions, batch_idx, batch_ptr, atom_names


def count_cross_edges(
    neighbor_matrix: torch.Tensor,
    num_neighbors: torch.Tensor,
    batch_idx: torch.Tensor,
    atom_names: list[str],
) -> tuple[int, int, list[str]]:
    total_edges = 0
    cross_edges = 0
    examples: list[str] = []

    for src in range(neighbor_matrix.shape[0]):
        src_system = int(batch_idx[src].item())
        for dst in neighbor_matrix[src, : int(num_neighbors[src].item())].tolist():
            dst = int(dst)
            total_edges += 1
            dst_system = int(batch_idx[dst].item())
            if src_system != dst_system:
                cross_edges += 1
                if len(examples) < 5:
                    examples.append(
                        f"{src}->{dst} "
                        f"(system {src_system}->{dst_system}, "
                        f"{atom_names[src]}->{atom_names[dst]})"
                    )

    return total_edges, cross_edges, examples


def run_case(method: str | None) -> tuple[str, int, int, list[str]]:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    positions, batch_idx, batch_ptr, atom_names = build_batched_waters(device=device)
    n_atoms = positions.shape[0]
    neighbor_matrix = torch.full(
        (n_atoms, 32), n_atoms, dtype=torch.int32, device=device
    )
    num_neighbors = torch.zeros(n_atoms, dtype=torch.int32, device=device)

    neighbor_list(
        positions=positions,
        cutoff=1.2,
        batch_idx=batch_idx,
        batch_ptr=batch_ptr,
        max_neighbors=32,
        neighbor_matrix=neighbor_matrix,
        num_neighbors=num_neighbors,
        method=method,
    )
    total_edges, cross_edges, examples = count_cross_edges(
        neighbor_matrix.cpu(),
        num_neighbors.cpu(),
        batch_idx.cpu(),
        atom_names,
    )
    return str(method), total_edges, cross_edges, examples


def main() -> None:
    print(f"nvalchemiops: {nvalchemiops.__file__}")
    print(f"torch device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
    print()
    print(f"{'method':16s} {'total_edges':>11s} {'cross_edges':>11s} examples")
    print("-" * 80)

    for method in ("batch_naive", "naive", "batch_cell_list", "cell_list", None):
        label, total_edges, cross_edges, examples = run_case(method)
        print(f"{label:16s} {total_edges:11d} {cross_edges:11d} {examples}")


if __name__ == "__main__":
    main()

Signed-off-by: Nikita Fedik <nfedik@nvidia.com>

nikitafedik · 2026-06-08T21:40:52Z

Updated after bot review: removed the unused _alloc_nl_kwargs arguments and the stale batch_ptr/plumbing comment in _alloc_staging_buffers. Re-ran ruff, diff-check, focused allocation tests, and the full test/hooks/test_neighbor_list_hook.py file: 123 passed.

greptile-apps Bot reviewed Jun 6, 2026

View reviewed changes

nikitafedik force-pushed the fix/batched-neighbor-method-guard branch from 24e312b to af576e7 Compare June 8, 2026 21:31

nikitafedik changed the title ~~fix(neighbors): select batched methods explicitly~~ fix(neighbors): delegate batched method selection to ops Jun 8, 2026

fix(neighbors): delegate batched method selection to ops

d3480a3

Signed-off-by: Nikita Fedik <nfedik@nvidia.com>

nikitafedik force-pushed the fix/batched-neighbor-method-guard branch from af576e7 to d3480a3 Compare June 8, 2026 21:40

nikitafedik marked this pull request as draft June 8, 2026 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(neighbors): delegate batched method selection to ops#106

fix(neighbors): delegate batched method selection to ops#106
nikitafedik wants to merge 1 commit into
NVIDIA:mainfrom
nikitafedik:fix/batched-neighbor-method-guard

nikitafedik commented Jun 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

greptile-apps Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

nikitafedik commented Jun 6, 2026

Uh oh!

nikitafedik commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikitafedik commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ALCHEMI Toolkit Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

greptile-apps Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

nikitafedik commented Jun 6, 2026

Uh oh!

nikitafedik commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikitafedik commented Jun 6, 2026 •

edited

Loading

greptile-apps Bot commented Jun 6, 2026 •

edited

Loading