Skip to content

fix(doctor): strip conda paths from LD_LIBRARY_PATH + add torch-npu to report#3

Merged
moonandlife merged 2 commits into
mainfrom
ws/npu-smi-preflight-fallback
Jun 16, 2026
Merged

fix(doctor): strip conda paths from LD_LIBRARY_PATH + add torch-npu to report#3
moonandlife merged 2 commits into
mainfrom
ws/npu-smi-preflight-fallback

Conversation

@moonandlife

@moonandlife moonandlife commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Root Cause

build_env_dict() preserved conda environment library paths in LD_LIBRARY_PATH:

  1. Active-vendor branch (line 490-493): kept ALL current LD_LIBRARY_PATH entries, including conda paths like /root/miniconda3/envs/vllm-hust-dev/lib
  2. Clean branch (line 494-500): _sanitize_ld_path() only removed /Ascend/ paths, preserving conda paths

Conda env lib/ directories contain system-like libraries (libstdc++.so.6, libgcc_s.so.1, libz.so.1) that shadow the system/CANN driver versions when placed in LD_LIBRARY_PATH. This causes:

  • npu-smi to crash (conda libstdc++ incompatible with CANN driver)
  • torch.npu.device_count() to silently return 0 (driver init fails with wrong libstdc++)

Fix

Commit 1: Add _strip_conda_ld_paths()

New function that filters out paths containing conda/mamba/anaconda/envs markers from LD_LIBRARY_PATH. Applied in both branches of build_env_dict(). Conda env libraries are already found via rpath and conda activation — they should never be in LD_LIBRARY_PATH.

Commit 2: Include torch-npu version in report

Add torch-npu to the packages dict in _runtime_report() for better diagnostics when torch_npu probe fails.

Add torch-npu to the packages dict in _runtime_report so that the
runtime check JSON output includes the installed torch-npu version.
This helps diagnose torch/torch_npu version mismatch issues when the
torch_npu probe fails to detect NPU devices.

Signed-off-by: Shuhao Zhang <shuhao_zhang@hust.edu.cn>
build_env_dict() previously preserved all existing LD_LIBRARY_PATH
entries (in the active-vendor branch) or only removed /Ascend/ paths
(in the clean branch). This allowed conda environment library paths
like /root/miniconda3/envs/vllm-hust-dev/lib to leak into
LD_LIBRARY_PATH, where conda-built system libraries (libstdc++.so.6,
libgcc_s.so.1, libz.so.1) shadow the system/CANN driver versions.

This causes npu-smi, torch_npu device detection, and other Ascend
runtime components to malfunction because they load incompatible
library versions from the conda environment instead of the system.

Fix: add _strip_conda_ld_paths() that filters out paths containing
conda/mamba/anaconda/envs markers, and apply it in both branches of
build_env_dict(). Conda env libraries are already found via rpath and
the conda activation mechanism — they should never be in LD_LIBRARY_PATH.

Signed-off-by: Shuhao Zhang <shuhao_zhang@hust.edu.cn>
@moonandlife moonandlife changed the title feat(runtime): include torch-npu version in runtime report fix(doctor): strip conda paths from LD_LIBRARY_PATH + add torch-npu to report Jun 16, 2026
@moonandlife moonandlife merged commit 071d240 into main Jun 16, 2026
1 check passed
@moonandlife moonandlife deleted the ws/npu-smi-preflight-fallback branch June 16, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant