Fix test_lru_cache_insert_large_grid associativity on ROCm wavefront64 by q10 · Pull Request #5899 · pytorch/FBGEMM

q10 · 2026-06-14T07:08:05Z

Summary:
test_lru_cache_insert_large_grid (added by D105282095) hardcodes the LXU
cache associativity as 32. The split-embeddings LXU cache is set-associative
with associativity == warp size == DEFAULT_ASSOC (32 on NVIDIA, 64 on AMD).

On AMD wavefront64 (gfx942 / MI300) lru_cache_insert_kernel strides cache
rows by kWarpSize = 64 and writes lxu_cache_state / lxu_cache_weights /
lru_state for slot in [0, 64), indexing past the 32-wide test
allocations -> out-of-bounds -> non-deterministic memory corruption ->
flaky assertEqual(lru_state != time_stamp, 0) failures in OSS ROCm CI
(see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel,
so the test passed.

Fix (test-only; no kernel/production change):

Size the three cache tensors and assertions by DEFAULT_ASSOC instead of
the literal 32, matching the established pattern in lxu_cache_test.py
and nbit_cache_test.py, so the allocation width matches the kernel's
kWarpSize associativity on both platforms.
Fix torch.accelerator.current_accelerator("cuda") -> current_accelerator()
(the string was silently coerced to check_available=True; flagged by
ai_diff_reviewer).
Generalize the docstring's NVIDIA-specific (32) grid math.

Differential Revision: D108540654

Summary: `test_lru_cache_insert_large_grid` (added by D105282095) hardcodes the LXU cache associativity as `32`. The split-embeddings LXU cache is set-associative with associativity == warp size == `DEFAULT_ASSOC` (32 on NVIDIA, 64 on AMD). On AMD wavefront64 (gfx942 / MI300) `lru_cache_insert_kernel` strides cache rows by `kWarpSize = 64` and writes `lxu_cache_state` / `lxu_cache_weights` / `lru_state` for `slot` in `[0, 64)`, indexing past the 32-wide test allocations -> out-of-bounds -> non-deterministic memory corruption -> flaky `assertEqual(lru_state != time_stamp, 0)` failures in OSS ROCm CI (see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel, so the test passed. Fix (test-only; no kernel/production change): - Size the three cache tensors and assertions by `DEFAULT_ASSOC` instead of the literal `32`, matching the established pattern in `lxu_cache_test.py` and `nbit_cache_test.py`, so the allocation width matches the kernel's `kWarpSize` associativity on both platforms. - Fix `torch.accelerator.current_accelerator("cuda")` -> `current_accelerator()` (the string was silently coerced to `check_available=True`; flagged by ai_diff_reviewer). - Generalize the docstring's NVIDIA-specific (32) grid math. Differential Revision: D108540654

meta-codesync · 2026-06-14T07:08:15Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108540654.

pytorch#5899) Summary: `test_lru_cache_insert_large_grid` (added by D105282095) hardcodes the LXU cache associativity as `32`. The split-embeddings LXU cache is set-associative with associativity == warp size == `DEFAULT_ASSOC` (32 on NVIDIA, 64 on AMD). On AMD wavefront64 (gfx942 / MI300) `lru_cache_insert_kernel` strides cache rows by `kWarpSize = 64` and writes `lxu_cache_state` / `lxu_cache_weights` / `lru_state` for `slot` in `[0, 64)`, indexing past the 32-wide test allocations -> out-of-bounds -> non-deterministic memory corruption -> flaky `assertEqual(lru_state != time_stamp, 0)` failures in OSS ROCm CI (see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel, so the test passed. Fix (test-only; no kernel/production change): - Size the three cache tensors and assertions by `DEFAULT_ASSOC` instead of the literal `32`, matching the established pattern in `lxu_cache_test.py` and `nbit_cache_test.py`, so the allocation width matches the kernel's `kWarpSize` associativity on both platforms. - Fix `torch.accelerator.current_accelerator("cuda")` -> `current_accelerator()` (the string was silently coerced to `check_available=True`; flagged by ai_diff_reviewer). - Generalize the docstring's NVIDIA-specific (32) grid math. Reviewed By: henrylhtsang Differential Revision: D108540654

meta-codesync · 2026-06-15T17:56:23Z

This pull request has been merged in 5fd50d1.

pytorch-bot Bot added ciflow/rocm module: rocm labels Jun 14, 2026

meta-cla Bot added the cla signed label Jun 14, 2026

meta-codesync Bot added the meta-exported label Jun 14, 2026

meta-codesync Bot closed this in 5fd50d1 Jun 15, 2026

meta-codesync Bot added the Merged label Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test_lru_cache_insert_large_grid associativity on ROCm wavefront64#5899

Fix test_lru_cache_insert_large_grid associativity on ROCm wavefront64#5899
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D108540654

q10 commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

q10 commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant