Skip to content

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312

Open
aschuh-hf wants to merge 3 commits into
ray-project:masterfrom
aschuh-hf:copilot/fix-gpu-profilling-manager
Open

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312
aschuh-hf wants to merge 3 commits into
ray-project:masterfrom
aschuh-hf:copilot/fix-gpu-profilling-manager

Conversation

@aschuh-hf
Copy link
Copy Markdown

Fixes #63243

On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare nvidia-smi triggers a blocking RPC to the FabricManager daemon. If another process (e.g. dynolog) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent ray.init() from succeeding.

Changes

  • node_has_gpus(): replace bare nvidia-smi with nvidia-smi --query-gpu=name --format=csv,noheader, which queries NVML directly (~0.1 s) without touching the FabricManager.
# Before
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)

# After
subprocess.check_output(
    ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
    stderr=subprocess.DEVNULL,
)
  • enabled property: move binary-presence checks before node_has_gpus() so the GPU detection call is skipped entirely on nodes without dynolog installed.
# Before: node_has_gpus() always called first
return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None

# After: short-circuits before the GPU check when dynolog is absent
return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus()
  • Tests: added test_node_has_gpus_uses_query_gpu_flag (asserts exact nvidia-smi flags) and test_enabled_does_not_call_node_has_gpus_when_dynolog_missing (asserts short-circuit behavior).

…led check

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
@aschuh-hf aschuh-hf requested a review from a team as a code owner May 12, 2026 22:41
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes GPU detection in the GpuProfilingManager by reordering the enabled property to short-circuit the GPU check when required binaries are missing and updating the nvidia-smi command with specific query flags to avoid stalls. Review feedback suggests adding a timeout to the subprocess.check_output call for better robustness against system hangs. Additionally, it was noted that the class constructor might still trigger the GPU check unconditionally, which could interfere with the intended optimization and cause test failures.

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated
Comment thread python/ray/dashboard/modules/reporter/gpu_profile_manager.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 9df1a13. Configure here.

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated
Copilot AI and others added 2 commits May 12, 2026 22:49
…node_has_gpus

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core] GpuProfilingManager: bare 'nvidia-smi' call stalls 15-20s on NVLink systems, causing raylet startup timeout

2 participants