fix(doctor): strip conda paths from LD_LIBRARY_PATH + add torch-npu to report#3
Merged
Merged
Conversation
Add torch-npu to the packages dict in _runtime_report so that the runtime check JSON output includes the installed torch-npu version. This helps diagnose torch/torch_npu version mismatch issues when the torch_npu probe fails to detect NPU devices. Signed-off-by: Shuhao Zhang <shuhao_zhang@hust.edu.cn>
build_env_dict() previously preserved all existing LD_LIBRARY_PATH entries (in the active-vendor branch) or only removed /Ascend/ paths (in the clean branch). This allowed conda environment library paths like /root/miniconda3/envs/vllm-hust-dev/lib to leak into LD_LIBRARY_PATH, where conda-built system libraries (libstdc++.so.6, libgcc_s.so.1, libz.so.1) shadow the system/CANN driver versions. This causes npu-smi, torch_npu device detection, and other Ascend runtime components to malfunction because they load incompatible library versions from the conda environment instead of the system. Fix: add _strip_conda_ld_paths() that filters out paths containing conda/mamba/anaconda/envs markers, and apply it in both branches of build_env_dict(). Conda env libraries are already found via rpath and the conda activation mechanism — they should never be in LD_LIBRARY_PATH. Signed-off-by: Shuhao Zhang <shuhao_zhang@hust.edu.cn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root Cause
build_env_dict()preserved conda environment library paths inLD_LIBRARY_PATH:LD_LIBRARY_PATHentries, including conda paths like/root/miniconda3/envs/vllm-hust-dev/lib_sanitize_ld_path()only removed/Ascend/paths, preserving conda pathsConda env
lib/directories contain system-like libraries (libstdc++.so.6,libgcc_s.so.1,libz.so.1) that shadow the system/CANN driver versions when placed inLD_LIBRARY_PATH. This causes:npu-smito crash (conda libstdc++ incompatible with CANN driver)torch.npu.device_count()to silently return 0 (driver init fails with wrong libstdc++)Fix
Commit 1: Add
_strip_conda_ld_paths()New function that filters out paths containing conda/mamba/anaconda/envs markers from
LD_LIBRARY_PATH. Applied in both branches ofbuild_env_dict(). Conda env libraries are already found via rpath and conda activation — they should never be inLD_LIBRARY_PATH.Commit 2: Include torch-npu version in report
Add
torch-nputo thepackagesdict in_runtime_report()for better diagnostics when torch_npu probe fails.