Resolve ewald cache shape issue by ys-teh · Pull Request #110 · NVIDIA/nvalchemi-toolkit

ys-teh · 2026-06-09T21:17:06Z

ALCHEMI Toolkit Pull Request

Description

The Ewald/PME wrappers used torch.allclose(cell, self._cached_cell) to detect cell changes for cache invalidation. This assumes the current and cached cell tensors have identical shape, dtype, and device. When the same wrapper instance sees a different batch size, for example between training and validation batches, cell.shape changes from (B1, 3, 3) to (B2, 3, 3). In that case torch.allclose raises instead of returning False, so the cache cannot be safely invalidated and model training is affected.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Documentation update
Refactoring (no functional changes)
CI/CD or infrastructure change

Related Issues

Changes Made

Implements a core cache invalidation fix for Ewald and PME cell caches.
- Added _cell_cache_needs_update() to both EwaldModelWrapper and PMEModelWrapper.
- The helper now treats missing cache, shape mismatch, device mismatch, dtype mismatch, or changed cell values as stale cache conditions.
- Updated the forward paths to call this helper before recomputing Ewald/PME cache state.
- Adds regression tests for both wrappers covering: missing cached cell, identical cached cell reuse, train/validation batch-size shape mismatch, same-shape changed cell values, dtype mismatch

Testing

Unit tests pass locally (make pytest)
Linting passes (make lint)
New tests added for new functionality meets coverage expectations?

Checklist

I have read and understand the Contributing Guidelines
I have updated the CHANGELOG.md
I have performed a self-review of my code
I have added docstrings to new functions/classes
I have updated the documentation (if applicable)

Additional Notes

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

copy-pr-bot · 2026-06-09T21:17:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-09T21:19:29Z

Greptile Summary

This PR fixes a crash in the Ewald and PME wrappers where torch.allclose was called on cell tensors with mismatched shapes (e.g. different batch sizes between training and validation), causing an exception instead of cache invalidation. The fix extracts the comparison into a shared cell_cache_needs_update helper in _utils.py that explicitly guards shape, device, and dtype before delegating to torch.allclose.

Core fix (_utils.py): New cell_cache_needs_update handles all mismatch conditions robustly, but its default tolerances (rtol=1e-5, atol≈1e-6) are 10–1000× looser than the originals (rtol=1e-6, atol=1e-9), which may cause stale-cache usage in fine-grained NPT simulations.
Wrapper updates (ewald.py, pme.py): Both wrappers now delegate to the shared helper and expose rtol/atol as constructor parameters for user control.
Tests (test_base.py): Five regression tests cover the main fix; the device-mismatch branch is not yet tested.

Important Files Changed

Filename	Overview
nvalchemi/models/_utils.py	Adds `cell_cache_needs_update` helper that guards against shape/device/dtype mismatches before calling `torch.allclose`; default tolerances are 10–1000× looser than the originals (`rtol=1e-6, atol=1e-9`), which could cause stale-cache misses in NPT simulations.
nvalchemi/models/ewald.py	Replaces inline `torch.allclose` call with `cell_cache_needs_update`; adds `rtol`/`atol` constructor parameters with correct wiring into forward path.
nvalchemi/models/pme.py	Mirrors the same cache-invalidation fix as `ewald.py`; changes are symmetric and correct.
test/models/test_base.py	Adds five unit tests for `cell_cache_needs_update` covering the key regression cases; device-mismatch path is not tested.

_{Reviews (4): Last reviewed commit: "update tolerances and break up OR statem..." | Re-trigger Greptile}

laserkelvin

In general looks good to me, just have some minor things to discuss

laserkelvin · 2026-06-15T18:43:38Z



+def cell_cache_needs_update(
+    cell: torch.Tensor,


Do you mind adding the appropriate jaxtyping shape annotations?

You might need to have them as separate hints to denote that the cached and the incoming cells can be different shapes

Updated, thanks for the reminder.

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

ys-teh · 2026-06-15T22:13:06Z

Thanks, I addressed all the comments including allowing users to change the tolerances when initializing the wrappers. All related tests passed.

laserkelvin

LGTM

laserkelvin · 2026-06-15T22:19:44Z

/ok to test b0af7b4

ys-teh marked this pull request as ready for review June 9, 2026 21:17

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread nvalchemi/models/ewald.py Outdated

ys-teh requested a review from dallasfoster June 9, 2026 22:07

laserkelvin requested changes Jun 15, 2026

View reviewed changes

ys-teh added 3 commits June 15, 2026 21:09

resolve ewald cache shape issue

09815ac

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

move cell_cache_needs_update to utils

77d1479

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

add jaxtyping

347d601

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

ys-teh force-pushed the fix/ewald-pme-cache branch from 62734a6 to 347d601 Compare June 15, 2026 21:46

update tolerances and break up OR statement

b0af7b4

Signed-off-by: Ying Shi Teh <yteh@nvidia.com>

laserkelvin approved these changes Jun 15, 2026

View reviewed changes

laserkelvin added this pull request to the merge queue Jun 15, 2026

Merged via the queue into NVIDIA:main with commit 01c99d5 Jun 15, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve ewald cache shape issue#110

Resolve ewald cache shape issue#110
laserkelvin merged 4 commits into
NVIDIA:mainfrom
ys-teh:fix/ewald-pme-cache

ys-teh commented Jun 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

laserkelvin left a comment

Uh oh!

laserkelvin Jun 15, 2026

Uh oh!

ys-teh Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

ys-teh commented Jun 15, 2026

Uh oh!

laserkelvin left a comment

Uh oh!

laserkelvin commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ys-teh commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ALCHEMI Toolkit Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Uh oh!

Uh oh!

laserkelvin left a comment

Choose a reason for hiding this comment

Uh oh!

laserkelvin Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

ys-teh Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ys-teh commented Jun 15, 2026

Uh oh!

laserkelvin left a comment

Choose a reason for hiding this comment

Uh oh!

laserkelvin commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ys-teh commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading