Skip to content

Fix test_lru_cache_insert_large_grid associativity on ROCm wavefront64#5899

Closed
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D108540654
Closed

Fix test_lru_cache_insert_large_grid associativity on ROCm wavefront64#5899
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D108540654

Conversation

@q10

@q10 q10 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary:
test_lru_cache_insert_large_grid (added by D105282095) hardcodes the LXU
cache associativity as 32. The split-embeddings LXU cache is set-associative
with associativity == warp size == DEFAULT_ASSOC (32 on NVIDIA, 64 on AMD).

On AMD wavefront64 (gfx942 / MI300) lru_cache_insert_kernel strides cache
rows by kWarpSize = 64 and writes lxu_cache_state / lxu_cache_weights /
lru_state for slot in [0, 64), indexing past the 32-wide test
allocations -> out-of-bounds -> non-deterministic memory corruption ->
flaky assertEqual(lru_state != time_stamp, 0) failures in OSS ROCm CI
(see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel,
so the test passed.

Fix (test-only; no kernel/production change):

  • Size the three cache tensors and assertions by DEFAULT_ASSOC instead of
    the literal 32, matching the established pattern in lxu_cache_test.py
    and nbit_cache_test.py, so the allocation width matches the kernel's
    kWarpSize associativity on both platforms.
  • Fix torch.accelerator.current_accelerator("cuda") -> current_accelerator()
    (the string was silently coerced to check_available=True; flagged by
    ai_diff_reviewer).
  • Generalize the docstring's NVIDIA-specific (32) grid math.

Differential Revision: D108540654

Summary:
`test_lru_cache_insert_large_grid` (added by D105282095) hardcodes the LXU
cache associativity as `32`. The split-embeddings LXU cache is set-associative
with associativity == warp size == `DEFAULT_ASSOC` (32 on NVIDIA, 64 on AMD).

On AMD wavefront64 (gfx942 / MI300) `lru_cache_insert_kernel` strides cache
rows by `kWarpSize = 64` and writes `lxu_cache_state` / `lxu_cache_weights` /
`lru_state` for `slot` in `[0, 64)`, indexing past the 32-wide test
allocations -> out-of-bounds -> non-deterministic memory corruption ->
flaky `assertEqual(lru_state != time_stamp, 0)` failures in OSS ROCm CI
(see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel,
so the test passed.

Fix (test-only; no kernel/production change):
- Size the three cache tensors and assertions by `DEFAULT_ASSOC` instead of
  the literal `32`, matching the established pattern in `lxu_cache_test.py`
  and `nbit_cache_test.py`, so the allocation width matches the kernel's
  `kWarpSize` associativity on both platforms.
- Fix `torch.accelerator.current_accelerator("cuda")` -> `current_accelerator()`
  (the string was silently coerced to `check_available=True`; flagged by
  ai_diff_reviewer).
- Generalize the docstring's NVIDIA-specific (32) grid math.

Differential Revision: D108540654
@meta-codesync

meta-codesync Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108540654.

q10 added a commit to q10/FBGEMM that referenced this pull request Jun 15, 2026
pytorch#5899)

Summary:

`test_lru_cache_insert_large_grid` (added by D105282095) hardcodes the LXU
cache associativity as `32`. The split-embeddings LXU cache is set-associative
with associativity == warp size == `DEFAULT_ASSOC` (32 on NVIDIA, 64 on AMD).

On AMD wavefront64 (gfx942 / MI300) `lru_cache_insert_kernel` strides cache
rows by `kWarpSize = 64` and writes `lxu_cache_state` / `lxu_cache_weights` /
`lru_state` for `slot` in `[0, 64)`, indexing past the 32-wide test
allocations -> out-of-bounds -> non-deterministic memory corruption ->
flaky `assertEqual(lru_state != time_stamp, 0)` failures in OSS ROCm CI
(see P2378242263). On NVIDIA (32 == 32) the allocation matches the kernel,
so the test passed.

Fix (test-only; no kernel/production change):
- Size the three cache tensors and assertions by `DEFAULT_ASSOC` instead of
  the literal `32`, matching the established pattern in `lxu_cache_test.py`
  and `nbit_cache_test.py`, so the allocation width matches the kernel's
  `kWarpSize` associativity on both platforms.
- Fix `torch.accelerator.current_accelerator("cuda")` -> `current_accelerator()`
  (the string was silently coerced to `check_available=True`; flagged by
  ai_diff_reviewer).
- Generalize the docstring's NVIDIA-specific (32) grid math.

Reviewed By: henrylhtsang

Differential Revision: D108540654
@meta-codesync meta-codesync Bot closed this in 5fd50d1 Jun 15, 2026
@meta-codesync

meta-codesync Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in 5fd50d1.

@meta-codesync meta-codesync Bot added the Merged label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant