fix(install): B200 CUDA support, validator UX, per-install SLURM TRES…#62
Conversation
… patch
- Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using
groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and
HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and
resolves the 'NVIDIA B200 ... is not compatible with the current
PyTorch installation' runtime error.
- Post-install validator now distinguishes 'transient libstdc++ load
failure after activation-script deploy' from real errors:
* New DependencyStatus.PENDING + _is_libstdcxx_load_error() helper.
* DependencyChecker runs _detect_libstdcxx_pending() at the top of
check_all() and flips self._libstdcxx_pending when conda lib isn't
yet on LD_LIBRARY_PATH but a newer conda libstdc++ exists.
* Package/libvips checks that fail with CXXABI_/GLIBCXX_/libstdc++
messages are reported as PENDING instead of MISSING/ERROR.
* _check_libstdcxx() itself reports PENDING (not ERROR) in the
transient post-deploy sub-case.
* _post_install_validate() no longer treats PENDING as a failure;
prints a single 'reactivate env' banner instead of the wall of six
CXXABI/GLIBCXX errors that used to scare users on fresh installs.
- Single-group `kintsugi install <group>` now calls
_auto_patch_slurm_tres() after the subprocess.run, so SLURM users who
only install 'gpu' or 'workflow' don't silently skip the TRES patch.
The patcher is already idempotent, so no-ops on groups that don't
bring in snakemake-executor-plugin-slurm-jobstep.
https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM
Collapses multi-line console.print / print calls and list comprehensions that ruff format wants on one line. No behavior change. https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM
There was a problem hiding this comment.
Pull request overview
Updates KINTSUGI’s install/validation flow to support NVIDIA Blackwell B200 (via CUDA 12.8 / cu128), reduce false-negative dependency validation noise right after HPC activation-script deployment, and ensure the SLURM TRES patch is applied even for single-group installs.
Changes:
- Bump PyTorch/CUDA targets from cu124/12.4 to cu128/12.8 across install groups and HPC env/docs.
- Add
DependencyStatus.PENDING+ libstdc++-mismatch detection to downgrade transient post-deploy import failures and avoid failing installs on self-healing conditions. - Run
_auto_patch_slurm_tres()for single-group installs, not justinstall all.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_deps.py | Updates CUDA version expectations in pre-install guard test. |
| src/kintsugi/deps.py | Adds PENDING status, libstdc++ transient detection/downgrade logic, and updates install recipes to cu128/12.8. |
| src/kintsugi/cli.py | Calls SLURM TRES auto-patch on single installs; adjusts post-install validator behavior/messaging for PENDING. |
| envs/env-hpc.yml | Bumps pytorch-cuda pin to 12.8 for HPC environments. |
| docs/HPC_TROUBLESHOOTING.md | Updates troubleshooting commands/notes to cu128/12.8. |
| CLAUDE.md | Updates documented torch CUDA index URL to cu128. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Re-install from PyTorch CUDA channel | ||
| pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124 | ||
| pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128 | ||
|
|
||
| # Or via conda (preferred on HPC): | ||
| conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge | ||
| conda install pytorch torchvision pytorch-cuda=12.8 -c pytorch -c nvidia -c conda-forge | ||
| ``` | ||
|
|
||
| **Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.4`. Never run bare `pip install torch` on HPC — always use the CUDA index URL. | ||
| **Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.8`. Never run bare `pip install torch` on HPC — always use the CUDA index URL. | ||
|
|
There was a problem hiding this comment.
This section updates the recommended CUDA/PyTorch versions to 12.8/cu128, but the same document still references CUDA 12.4 later (“If this prints CUDA build: 12.4…”). Please update that later example too so the troubleshooting guide is internally consistent.
| except ImportError as e: | ||
| err_msg = str(e) | ||
| if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg): | ||
| return DependencyResult( | ||
| name=package, | ||
| status=DependencyStatus.PENDING, | ||
| required_version=min_version, | ||
| message="Will verify after env reactivation (stale LD_LIBRARY_PATH)", | ||
| is_optional=optional, | ||
| ) | ||
| status = DependencyStatus.OPTIONAL_MISSING if optional else DependencyStatus.MISSING | ||
| return DependencyResult( | ||
| name=package, | ||
| status=status, | ||
| required_version=min_version, | ||
| message=str(e), | ||
| message=err_msg, | ||
| is_optional=optional, | ||
| ) | ||
| except Exception as e: | ||
| err_msg = str(e) | ||
| if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg): | ||
| return DependencyResult( | ||
| name=package, | ||
| status=DependencyStatus.PENDING, | ||
| message="Will verify after env reactivation (stale LD_LIBRARY_PATH)", | ||
| is_optional=optional, | ||
| ) | ||
| return DependencyResult( |
There was a problem hiding this comment.
New DependencyStatus.PENDING behavior (downgrading libstdc++-signature ImportErrors / summary counting PENDING as non-fatal) isn’t covered by tests. Please add unit tests that (1) force _libstdcxx_pending=True, (2) raise an ImportError containing a GLIBCXX_/CXXABI_ token from _check_python_package() / _check_libvips(), and (3) assert the resulting status is PENDING and that _generate_summary() treats it as non-missing.
| # Single clear banner when activation scripts were just deployed — avoids | ||
| # the cascade of red CXXABI/GLIBCXX errors that used to scare the user on | ||
| # a fresh install. | ||
| if checker._libstdcxx_pending: | ||
| console.print("\n[yellow]⚠ Environment activation scripts were just deployed.[/yellow]") | ||
| console.print("[yellow] Reactivate the env to verify all libraries load:[/yellow]") | ||
| console.print("[yellow] conda deactivate && conda activate KINTSUGI[/yellow]") | ||
| console.print("[yellow] kintsugi check[/yellow]") | ||
|
|
||
| return True | ||
|
|
||
|
|
There was a problem hiding this comment.
_post_install_validate() prints the “activation scripts were just deployed” banner based solely on checker._libstdcxx_pending. With the current _detect_libstdcxx_pending() logic, this flag will be True in most Linux conda envs where LD_LIBRARY_PATH doesn’t include $CONDA_PREFIX/lib, even when no checks are actually pending. Recommend driving the banner off actual PENDING results (e.g., any(r.status == DependencyStatus.PENDING for r in checker.results)) and/or the summary returned by check_all() instead of the private _libstdcxx_pending field.
| return # conda lib already on LD_LIBRARY_PATH — no cascade risk | ||
| conda_libstdcxx = Path(conda_prefix) / "lib" / "libstdc++.so.6" | ||
| if not conda_libstdcxx.exists(): | ||
| return # no newer libstdc++ to shadow the system one |
There was a problem hiding this comment.
_detect_libstdcxx_pending() currently sets _libstdcxx_pending=True whenever a conda env is active, $CONDA_PREFIX/lib is not on LD_LIBRARY_PATH, and $CONDA_PREFIX/lib/libstdc++.so.6 exists. That combination is common even on non-HPC systems and doesn’t necessarily mean activation scripts were just deployed or that a transient mismatch exists. Consider tightening the predicate (e.g., require the deployed activation script file to exist under $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh, and/or confirm a version-mismatch condition) so PENDING downgrades only happen in the intended transient post-deploy case.
| return # no newer libstdc++ to shadow the system one | |
| return # no newer libstdc++ to shadow the system one | |
| activate_script = ( | |
| Path(conda_prefix) / "etc" / "conda" / "activate.d" / "env_vars.sh" | |
| ) | |
| if not activate_script.is_file(): | |
| return # no deployed activation hook; avoid broad false-positive PENDING |
… patch
Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and resolves the 'NVIDIA B200 ... is not compatible with the current PyTorch installation' runtime error.
Post-install validator now distinguishes 'transient libstdc++ load failure after activation-script deploy' from real errors:
Single-group
kintsugi install <group>now calls _auto_patch_slurm_tres() after the subprocess.run, so SLURM users who only install 'gpu' or 'workflow' don't silently skip the TRES patch. The patcher is already idempotent, so no-ops on groups that don't bring in snakemake-executor-plugin-slurm-jobstep.https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM