fix(install): B200 CUDA support, validator UX, per-install SLURM TRES… by smith6jt-cop · Pull Request #62 · smith6jt-cop/KINTSUGI

smith6jt-cop · 2026-04-16T20:17:00Z

… patch

Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and resolves the 'NVIDIA B200 ... is not compatible with the current PyTorch installation' runtime error.
Post-install validator now distinguishes 'transient libstdc++ load failure after activation-script deploy' from real errors:
- New DependencyStatus.PENDING + _is_libstdcxx_load_error() helper.
- DependencyChecker runs _detect_libstdcxx_pending() at the top of check_all() and flips self._libstdcxx_pending when conda lib isn't yet on LD_LIBRARY_PATH but a newer conda libstdc++ exists.
- Package/libvips checks that fail with CXXABI_/GLIBCXX_/libstdc++ messages are reported as PENDING instead of MISSING/ERROR.
- _check_libstdcxx() itself reports PENDING (not ERROR) in the transient post-deploy sub-case.
- _post_install_validate() no longer treats PENDING as a failure; prints a single 'reactivate env' banner instead of the wall of six CXXABI/GLIBCXX errors that used to scare users on fresh installs.
Single-group kintsugi install <group> now calls _auto_patch_slurm_tres() after the subprocess.run, so SLURM users who only install 'gpu' or 'workflow' don't silently skip the TRES patch. The patcher is already idempotent, so no-ops on groups that don't bring in snakemake-executor-plugin-slurm-jobstep.

https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

… patch - Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and resolves the 'NVIDIA B200 ... is not compatible with the current PyTorch installation' runtime error. - Post-install validator now distinguishes 'transient libstdc++ load failure after activation-script deploy' from real errors: * New DependencyStatus.PENDING + _is_libstdcxx_load_error() helper. * DependencyChecker runs _detect_libstdcxx_pending() at the top of check_all() and flips self._libstdcxx_pending when conda lib isn't yet on LD_LIBRARY_PATH but a newer conda libstdc++ exists. * Package/libvips checks that fail with CXXABI_/GLIBCXX_/libstdc++ messages are reported as PENDING instead of MISSING/ERROR. * _check_libstdcxx() itself reports PENDING (not ERROR) in the transient post-deploy sub-case. * _post_install_validate() no longer treats PENDING as a failure; prints a single 'reactivate env' banner instead of the wall of six CXXABI/GLIBCXX errors that used to scare users on fresh installs. - Single-group `kintsugi install <group>` now calls _auto_patch_slurm_tres() after the subprocess.run, so SLURM users who only install 'gpu' or 'workflow' don't silently skip the TRES patch. The patcher is already idempotent, so no-ops on groups that don't bring in snakemake-executor-plugin-slurm-jobstep. https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

Collapses multi-line console.print / print calls and list comprehensions that ruff format wants on one line. No behavior change. https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

Copilot

Pull request overview

Updates KINTSUGI’s install/validation flow to support NVIDIA Blackwell B200 (via CUDA 12.8 / cu128), reduce false-negative dependency validation noise right after HPC activation-script deployment, and ensure the SLURM TRES patch is applied even for single-group installs.

Changes:

Bump PyTorch/CUDA targets from cu124/12.4 to cu128/12.8 across install groups and HPC env/docs.
Add DependencyStatus.PENDING + libstdc++-mismatch detection to downgrade transient post-deploy import failures and avoid failing installs on self-healing conditions.
Run _auto_patch_slurm_tres() for single-group installs, not just install all.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_deps.py	Updates CUDA version expectations in pre-install guard test.
src/kintsugi/deps.py	Adds PENDING status, libstdc++ transient detection/downgrade logic, and updates install recipes to cu128/12.8.
src/kintsugi/cli.py	Calls SLURM TRES auto-patch on single installs; adjusts post-install validator behavior/messaging for PENDING.
envs/env-hpc.yml	Bumps `pytorch-cuda` pin to 12.8 for HPC environments.
docs/HPC_TROUBLESHOOTING.md	Updates troubleshooting commands/notes to cu128/12.8.
CLAUDE.md	Updates documented torch CUDA index URL to cu128.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T20:21:41Z

 # Re-install from PyTorch CUDA channel
-pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
+pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

 # Or via conda (preferred on HPC):
-conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge
+conda install pytorch torchvision pytorch-cuda=12.8 -c pytorch -c nvidia -c conda-forge
 ```

-**Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.4`. Never run bare `pip install torch` on HPC — always use the CUDA index URL.
+**Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.8`. Never run bare `pip install torch` on HPC — always use the CUDA index URL.



This section updates the recommended CUDA/PyTorch versions to 12.8/cu128, but the same document still references CUDA 12.4 later (“If this prints CUDA build: 12.4…”). Please update that later example too so the troubleshooting guide is internally consistent.

Copilot · 2026-04-16T20:21:42Z

        except ImportError as e:
+            err_msg = str(e)
+            if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg):
+                return DependencyResult(
+                    name=package,
+                    status=DependencyStatus.PENDING,
+                    required_version=min_version,
+                    message="Will verify after env reactivation (stale LD_LIBRARY_PATH)",
+                    is_optional=optional,
+                )
            status = DependencyStatus.OPTIONAL_MISSING if optional else DependencyStatus.MISSING
            return DependencyResult(
                name=package,
                status=status,
                required_version=min_version,
-                message=str(e),
+                message=err_msg,
                is_optional=optional,
            )
        except Exception as e:
+            err_msg = str(e)
+            if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg):
+                return DependencyResult(
+                    name=package,
+                    status=DependencyStatus.PENDING,
+                    message="Will verify after env reactivation (stale LD_LIBRARY_PATH)",
+                    is_optional=optional,
+                )
            return DependencyResult(


New DependencyStatus.PENDING behavior (downgrading libstdc++-signature ImportErrors / summary counting PENDING as non-fatal) isn’t covered by tests. Please add unit tests that (1) force _libstdcxx_pending=True, (2) raise an ImportError containing a GLIBCXX_/CXXABI_ token from _check_python_package() / _check_libvips(), and (3) assert the resulting status is PENDING and that _generate_summary() treats it as non-missing.

Copilot · 2026-04-16T20:21:42Z

+    # Single clear banner when activation scripts were just deployed — avoids
+    # the cascade of red CXXABI/GLIBCXX errors that used to scare the user on
+    # a fresh install.
+    if checker._libstdcxx_pending:
+        console.print("\n[yellow]⚠  Environment activation scripts were just deployed.[/yellow]")
+        console.print("[yellow]   Reactivate the env to verify all libraries load:[/yellow]")
+        console.print("[yellow]     conda deactivate && conda activate KINTSUGI[/yellow]")
+        console.print("[yellow]     kintsugi check[/yellow]")
+
    return True




_post_install_validate() prints the “activation scripts were just deployed” banner based solely on checker._libstdcxx_pending. With the current _detect_libstdcxx_pending() logic, this flag will be True in most Linux conda envs where LD_LIBRARY_PATH doesn’t include $CONDA_PREFIX/lib, even when no checks are actually pending. Recommend driving the banner off actual PENDING results (e.g., any(r.status == DependencyStatus.PENDING for r in checker.results)) and/or the summary returned by check_all() instead of the private _libstdcxx_pending field.

Copilot · 2026-04-16T20:21:42Z

+            return  # conda lib already on LD_LIBRARY_PATH — no cascade risk
+        conda_libstdcxx = Path(conda_prefix) / "lib" / "libstdc++.so.6"
+        if not conda_libstdcxx.exists():
+            return  # no newer libstdc++ to shadow the system one


_detect_libstdcxx_pending() currently sets _libstdcxx_pending=True whenever a conda env is active, $CONDA_PREFIX/lib is not on LD_LIBRARY_PATH, and $CONDA_PREFIX/lib/libstdc++.so.6 exists. That combination is common even on non-HPC systems and doesn’t necessarily mean activation scripts were just deployed or that a transient mismatch exists. Consider tightening the predicate (e.g., require the deployed activation script file to exist under $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh, and/or confirm a version-mismatch condition) so PENDING downgrades only happen in the intended transient post-deploy case.

Suggested change

return # no newer libstdc++ to shadow the system one

return # no newer libstdc++ to shadow the system one

activate_script = (

Path(conda_prefix) / "etc" / "conda" / "activate.d" / "env_vars.sh"

)

if not activate_script.is_file():

return # no deployed activation hook; avoid broad false-positive PENDING

Copilot AI review requested due to automatic review settings April 16, 2026 20:17

smith6jt-cop self-assigned this Apr 16, 2026

Copilot started reviewing on behalf of smith6jt-cop April 16, 2026 20:17 View session

style: apply ruff format to match CI line-length policy

a336fdc

Collapses multi-line console.print / print calls and list comprehensions that ruff format wants on one line. No behavior change. https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

Copilot AI reviewed Apr 16, 2026

View reviewed changes

smith6jt-cop merged commit bbecd84 into main Apr 16, 2026
10 checks passed

smith6jt-cop deleted the claude/fix-dependency-versions-AglwB branch April 16, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(install): B200 CUDA support, validator UX, per-install SLURM TRES…#62

fix(install): B200 CUDA support, validator UX, per-install SLURM TRES…#62
smith6jt-cop merged 2 commits into
mainfrom
claude/fix-dependency-versions-AglwB

smith6jt-cop commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            return  # no newer libstdc++ to shadow the system one
+            return  # no newer libstdc++ to shadow the system one
+        activate_script = (
+            Path(conda_prefix) / "etc" / "conda" / "activate.d" / "env_vars.sh"
+        )
+        if not activate_script.is_file():
+            return  # no deployed activation hook; avoid broad false-positive PENDING

Conversation

smith6jt-cop commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants