Skip to content

fix bugs when device is None in get_full_tflops_approx and add b200 tflops#172

Open
WhatGhost wants to merge 2 commits into
ByteDance-Seed:mainfrom
WhatGhost:dev-b200
Open

fix bugs when device is None in get_full_tflops_approx and add b200 tflops#172
WhatGhost wants to merge 2 commits into
ByteDance-Seed:mainfrom
WhatGhost:dev-b200

Conversation

@WhatGhost

Copy link
Copy Markdown

Summary

This PR adds NVIDIA B200 support to the NVIDIA GEMM performance model and fixes the fallback TFLOPS estimation path for unknown GPUs.

Problem

When running the NVIDIA MoE installation test on B200, the test failed before executing the actual MoE kernel. The failure happened while estimating Tensor Core TFLOPS for logging/performance modeling.

B200 was not listed in get_tensorcore_tflops_by_device_name(), so the code fell back to the estimation path:

WARNING:root:device NVIDIA B200 not listed here. calculate tflops by estimation, or you can report it to developers.
...
TypeError: 'NoneType' object cannot be interpreted as an integer

The fallback path accepted device=None for torch.cuda.get_device_properties(), but then passed the same None into NVML via nvmlDeviceGetHandleByIndex(), which requires an integer device index.

Fix

  1. add B200 tflops and DRAM bandwidth into the list
  2. Normalize device=None to torch.cuda.current_device() in the fallback TFLOPS estimation path, so future unknown GPUs do not crash when querying NVML.
    3.3.3.

The tests passed successfuly after fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant