Skip to content

fix(install): B200 CUDA support, validator UX, per-install SLURM TRES…#62

Merged
smith6jt-cop merged 2 commits into
mainfrom
claude/fix-dependency-versions-AglwB
Apr 16, 2026
Merged

fix(install): B200 CUDA support, validator UX, per-install SLURM TRES…#62
smith6jt-cop merged 2 commits into
mainfrom
claude/fix-dependency-versions-AglwB

Conversation

@smith6jt-cop

Copy link
Copy Markdown
Owner

… patch

  • Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and resolves the 'NVIDIA B200 ... is not compatible with the current PyTorch installation' runtime error.

  • Post-install validator now distinguishes 'transient libstdc++ load failure after activation-script deploy' from real errors:

    • New DependencyStatus.PENDING + _is_libstdcxx_load_error() helper.
    • DependencyChecker runs _detect_libstdcxx_pending() at the top of check_all() and flips self._libstdcxx_pending when conda lib isn't yet on LD_LIBRARY_PATH but a newer conda libstdc++ exists.
    • Package/libvips checks that fail with CXXABI_/GLIBCXX_/libstdc++ messages are reported as PENDING instead of MISSING/ERROR.
    • _check_libstdcxx() itself reports PENDING (not ERROR) in the transient post-deploy sub-case.
    • _post_install_validate() no longer treats PENDING as a failure; prints a single 'reactivate env' banner instead of the wall of six CXXABI/GLIBCXX errors that used to scare users on fresh installs.
  • Single-group kintsugi install <group> now calls _auto_patch_slurm_tres() after the subprocess.run, so SLURM users who only install 'gpu' or 'workflow' don't silently skip the TRES patch. The patcher is already idempotent, so no-ops on groups that don't bring in snakemake-executor-plugin-slurm-jobstep.

https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

… patch

- Bump PyTorch/CUDA target from cu124 to cu128 across all torch-using
  groups (gpu, dl, kronos, denoise, full) plus env-hpc.yml and
  HPC_TROUBLESHOOTING.md. cu128 supports Blackwell B200 (sm_100) and
  resolves the 'NVIDIA B200 ... is not compatible with the current
  PyTorch installation' runtime error.

- Post-install validator now distinguishes 'transient libstdc++ load
  failure after activation-script deploy' from real errors:
  * New DependencyStatus.PENDING + _is_libstdcxx_load_error() helper.
  * DependencyChecker runs _detect_libstdcxx_pending() at the top of
    check_all() and flips self._libstdcxx_pending when conda lib isn't
    yet on LD_LIBRARY_PATH but a newer conda libstdc++ exists.
  * Package/libvips checks that fail with CXXABI_/GLIBCXX_/libstdc++
    messages are reported as PENDING instead of MISSING/ERROR.
  * _check_libstdcxx() itself reports PENDING (not ERROR) in the
    transient post-deploy sub-case.
  * _post_install_validate() no longer treats PENDING as a failure;
    prints a single 'reactivate env' banner instead of the wall of six
    CXXABI/GLIBCXX errors that used to scare users on fresh installs.

- Single-group `kintsugi install <group>` now calls
  _auto_patch_slurm_tres() after the subprocess.run, so SLURM users who
  only install 'gpu' or 'workflow' don't silently skip the TRES patch.
  The patcher is already idempotent, so no-ops on groups that don't
  bring in snakemake-executor-plugin-slurm-jobstep.

https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM
Copilot AI review requested due to automatic review settings April 16, 2026 20:17
@smith6jt-cop smith6jt-cop self-assigned this Apr 16, 2026
Collapses multi-line console.print / print calls and list comprehensions
that ruff format wants on one line. No behavior change.

https://claude.ai/code/session_012CMWp8e1t8HDQt8NLPp6xM

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates KINTSUGI’s install/validation flow to support NVIDIA Blackwell B200 (via CUDA 12.8 / cu128), reduce false-negative dependency validation noise right after HPC activation-script deployment, and ensure the SLURM TRES patch is applied even for single-group installs.

Changes:

  • Bump PyTorch/CUDA targets from cu124/12.4 to cu128/12.8 across install groups and HPC env/docs.
  • Add DependencyStatus.PENDING + libstdc++-mismatch detection to downgrade transient post-deploy import failures and avoid failing installs on self-healing conditions.
  • Run _auto_patch_slurm_tres() for single-group installs, not just install all.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_deps.py Updates CUDA version expectations in pre-install guard test.
src/kintsugi/deps.py Adds PENDING status, libstdc++ transient detection/downgrade logic, and updates install recipes to cu128/12.8.
src/kintsugi/cli.py Calls SLURM TRES auto-patch on single installs; adjusts post-install validator behavior/messaging for PENDING.
envs/env-hpc.yml Bumps pytorch-cuda pin to 12.8 for HPC environments.
docs/HPC_TROUBLESHOOTING.md Updates troubleshooting commands/notes to cu128/12.8.
CLAUDE.md Updates documented torch CUDA index URL to cu128.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 93 to 101
# Re-install from PyTorch CUDA channel
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Or via conda (preferred on HPC):
conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge
conda install pytorch torchvision pytorch-cuda=12.8 -c pytorch -c nvidia -c conda-forge
```

**Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.4`. Never run bare `pip install torch` on HPC — always use the CUDA index URL.
**Prevention:** Use `envs/env-hpc.yml` which installs PyTorch from the `pytorch` conda channel with `pytorch-cuda=12.8`. Never run bare `pip install torch` on HPC — always use the CUDA index URL.

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section updates the recommended CUDA/PyTorch versions to 12.8/cu128, but the same document still references CUDA 12.4 later (“If this prints CUDA build: 12.4…”). Please update that later example too so the troubleshooting guide is internally consistent.

Copilot uses AI. Check for mistakes.
Comment thread src/kintsugi/deps.py
Comment on lines 765 to 792
except ImportError as e:
err_msg = str(e)
if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg):
return DependencyResult(
name=package,
status=DependencyStatus.PENDING,
required_version=min_version,
message="Will verify after env reactivation (stale LD_LIBRARY_PATH)",
is_optional=optional,
)
status = DependencyStatus.OPTIONAL_MISSING if optional else DependencyStatus.MISSING
return DependencyResult(
name=package,
status=status,
required_version=min_version,
message=str(e),
message=err_msg,
is_optional=optional,
)
except Exception as e:
err_msg = str(e)
if self._libstdcxx_pending and _is_libstdcxx_load_error(err_msg):
return DependencyResult(
name=package,
status=DependencyStatus.PENDING,
message="Will verify after env reactivation (stale LD_LIBRARY_PATH)",
is_optional=optional,
)
return DependencyResult(

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New DependencyStatus.PENDING behavior (downgrading libstdc++-signature ImportErrors / summary counting PENDING as non-fatal) isn’t covered by tests. Please add unit tests that (1) force _libstdcxx_pending=True, (2) raise an ImportError containing a GLIBCXX_/CXXABI_ token from _check_python_package() / _check_libvips(), and (3) assert the resulting status is PENDING and that _generate_summary() treats it as non-missing.

Copilot uses AI. Check for mistakes.
Comment thread src/kintsugi/cli.py
Comment on lines +397 to 408
# Single clear banner when activation scripts were just deployed — avoids
# the cascade of red CXXABI/GLIBCXX errors that used to scare the user on
# a fresh install.
if checker._libstdcxx_pending:
console.print("\n[yellow]⚠ Environment activation scripts were just deployed.[/yellow]")
console.print("[yellow] Reactivate the env to verify all libraries load:[/yellow]")
console.print("[yellow] conda deactivate && conda activate KINTSUGI[/yellow]")
console.print("[yellow] kintsugi check[/yellow]")

return True


Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_post_install_validate() prints the “activation scripts were just deployed” banner based solely on checker._libstdcxx_pending. With the current _detect_libstdcxx_pending() logic, this flag will be True in most Linux conda envs where LD_LIBRARY_PATH doesn’t include $CONDA_PREFIX/lib, even when no checks are actually pending. Recommend driving the banner off actual PENDING results (e.g., any(r.status == DependencyStatus.PENDING for r in checker.results)) and/or the summary returned by check_all() instead of the private _libstdcxx_pending field.

Copilot uses AI. Check for mistakes.
Comment thread src/kintsugi/deps.py
return # conda lib already on LD_LIBRARY_PATH — no cascade risk
conda_libstdcxx = Path(conda_prefix) / "lib" / "libstdc++.so.6"
if not conda_libstdcxx.exists():
return # no newer libstdc++ to shadow the system one

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_detect_libstdcxx_pending() currently sets _libstdcxx_pending=True whenever a conda env is active, $CONDA_PREFIX/lib is not on LD_LIBRARY_PATH, and $CONDA_PREFIX/lib/libstdc++.so.6 exists. That combination is common even on non-HPC systems and doesn’t necessarily mean activation scripts were just deployed or that a transient mismatch exists. Consider tightening the predicate (e.g., require the deployed activation script file to exist under $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh, and/or confirm a version-mismatch condition) so PENDING downgrades only happen in the intended transient post-deploy case.

Suggested change
return # no newer libstdc++ to shadow the system one
return # no newer libstdc++ to shadow the system one
activate_script = (
Path(conda_prefix) / "etc" / "conda" / "activate.d" / "env_vars.sh"
)
if not activate_script.is_file():
return # no deployed activation hook; avoid broad false-positive PENDING

Copilot uses AI. Check for mistakes.
@smith6jt-cop smith6jt-cop merged commit bbecd84 into main Apr 16, 2026
10 checks passed
@smith6jt-cop smith6jt-cop deleted the claude/fix-dependency-versions-AglwB branch April 16, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants