[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312
[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312aschuh-hf wants to merge 3 commits into
Conversation
…led check Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes GPU detection in the GpuProfilingManager by reordering the enabled property to short-circuit the GPU check when required binaries are missing and updating the nvidia-smi command with specific query flags to avoid stalls. Review feedback suggests adding a timeout to the subprocess.check_output call for better robustness against system hangs. Additionally, it was noted that the class constructor might still trigger the GPU check unconditionally, which could interfere with the intended optimization and cause test failures.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 9df1a13. Configure here.
…node_has_gpus Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

Fixes #63243
On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare
nvidia-smitriggers a blocking RPC to the FabricManager daemon. If another process (e.g.dynolog) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and preventray.init()from succeeding.Changes
node_has_gpus(): replace barenvidia-smiwithnvidia-smi --query-gpu=name --format=csv,noheader, which queries NVML directly (~0.1 s) without touching the FabricManager.enabledproperty: move binary-presence checks beforenode_has_gpus()so the GPU detection call is skipped entirely on nodes withoutdynologinstalled.test_node_has_gpus_uses_query_gpu_flag(asserts exact nvidia-smi flags) andtest_enabled_does_not_call_node_has_gpus_when_dynolog_missing(asserts short-circuit behavior).