Skip to content

fix(ci): prevent CUDA hang and add test timeouts#76

Merged
royerloic merged 5 commits into
mainfrom
fix/ci-timeout-cuda-hang
Feb 13, 2026
Merged

fix(ci): prevent CUDA hang and add test timeouts#76
royerloic merged 5 commits into
mainfrom
fix/ci-timeout-cuda-hang

Conversation

@royerloic

@royerloic royerloic commented Feb 13, 2026

Copy link
Copy Markdown
Member

Summary

  • Set CUDA_VISIBLE_DEVICES="" in CI env to prevent TensorFlow/PyTorch from trying to initialize CUDA on CPU-only runners (CUDA pip packages are present but no GPU driver, causing an indefinite hang during test collection)
  • Add timeout-minutes: 30 to the CI job to prevent 6-hour runs (GitHub Actions default)
  • Add --timeout=300 (5 min per test) via pytest-timeout to kill individual hanging tests
  • Add pytest-timeout to testing dependencies
  • Mark 3 network-dependent tests as @pytest.mark.integration so they are excluded from CI: test_web_search_tool, test_wikipedia_search_tool, test_add_image_cell

Root cause

All recent CI runs were "cancelled" after ~6 hours. The test command pytest src/ -v -k "not integration and not llm and not slow" collects tests including cellpose_test.py and stardist_test.py, whose @pytest.mark.skipif(not check_*_installed(), ...) decorators import cellpose (→ PyTorch → CUDA) and stardist (→ csbdeep → TensorFlow → CUDA) at collection time. On GitHub Actions runners, the CUDA pip packages (nvidia-cuda-nvcc-cu12, etc.) are installed but no GPU driver exists, causing TensorFlow's CUDA initialization to hang indefinitely.

Test plan

  • All 233 tests pass locally with CUDA_VISIBLE_DEVICES="" and --timeout=300 in 3m28s
  • CI completes within 30 minutes after merge

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Marked several tests as integration to improve organization.
    • Added CPU-only runtime controls and stricter job/test timeouts to reduce CI hangs.
  • Chores

    • Added pytest-timeout to test dependencies.
  • Bug Fixes / Reliability

    • Avoid heavy imports when checking optional imaging packages to reduce CI/workstation overhead.
    • Make segmentation routines tolerant across scikit-image versions to prevent failures.
    • Skip interactive API-key dialogs in non-interactive environments and require an API key before LLM initialization.

TensorFlow/PyTorch CUDA initialization hangs indefinitely on GitHub
Actions CPU-only runners because CUDA pip packages are present but
no GPU driver exists. Set CUDA_VISIBLE_DEVICES="" to disable GPU
detection, add 30-minute job timeout, and 5-minute per-test timeout
via pytest-timeout. Also mark network-dependent tests as integration
so they are excluded from CI runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Feb 13, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds CI job-level timeout and CPU-only env vars, introduces pytest timeout in CI and test deps, marks several tests as integration tests, replaces direct-import checks with importlib.util.find_spec for package detection, and adds scikit-image-version-aware calls to remove_small_objects.

Changes

Cohort / File(s) Summary
CI Workflow
\.github/workflows/test.yml
Adds 30-minute job timeout; sets CPU-only env vars (CUDA_VISIBLE_DEVICES="-1", TORCH_CUDA_ARCH_LIST="", USE_CUDA="0"); preserves existing test command but adds --timeout=300 to pytest and keeps verbose/test selection.
Test Dependencies
pyproject.toml
Adds pytest-timeout to optional test deps and to tool.hatch.envs.hatch-test extra-dependencies (reorders test-related entries).
Integration Test Markers
src/napari_chatgpt/omega_agent/tools/tests/web_search_tool_tests.py, src/napari_chatgpt/omega_agent/tools/tests/wikipedia_search_tool_tests.py, src/napari_chatgpt/utils/notebook/test/jupyter_notebook_test.py
Adds @pytest.mark.integration to three tests and imports pytest where needed; no logic changes.
Package-detection refactor & tests
src/napari_chatgpt/omega_agent/tools/napari/delegated_code/utils.py, src/napari_chatgpt/omega_agent/tools/napari/delegated_code/test/utils_test.py
Replace try/except import checks with importlib.util.find_spec(...) to detect stardist/cellpose without importing; update tests and module docstring accordingly.
API key / interactive guard
src/napari_chatgpt/llm/api_keys/api_key.py
Add private _is_interactive() helper; skip GUI dialog in non-interactive/CI environments and adjust messaging/control flow in set_api_key().
LLM availability pre-check
src/napari_chatgpt/llm/litemind_api.py
is_llm_available now requires at least one API key (OpenAI/Anthropic/Gemini) before proceeding; adds local import/check for keys to avoid unnecessary initialization.
scikit-image compatibility (core code & tests)
src/napari_chatgpt/omega_agent/tools/napari/delegated_code/classic.py, src/napari_chatgpt/utils/napari/test/napari_viewer_info_test.py, src/napari_chatgpt/utils/segmentation/remove_small_segments.py
Introduce version-aware calls to remove_small_objects: try max_size=... (newer scikit-image) and fall back to min_size=... on TypeError to preserve behavior across scikit-image versions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled CI carrots, calm and spry,

Timeouts trimmed the thicket, tests hop by,
CUDA tucked in for a quiet nap,
mark.integration gave each test a map,
find_spec peeks — imports take a nap. 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: preventing CUDA hangs and adding test timeouts to the CI workflow, which are the primary objectives.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/ci-timeout-cuda-hang

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

royerloic and others added 2 commits February 12, 2026 22:49
Replace `import stardist` / `import cellpose` in check_*_installed()
with `importlib.util.find_spec()` which detects the package without
loading it. This prevents TensorFlow/PyTorch from being imported at
pytest collection time, which hangs on CI runners without GPU drivers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add import verification and collect-only diagnostic steps with their
own timeouts to pinpoint where the hang occurs. Use CUDA_VISIBLE_DEVICES=-1
(explicit no-GPU) instead of empty string, plus TORCH_CUDA_ARCH_LIST
and USE_CUDA=0 to further prevent CUDA initialization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 64-66: The workflow step "Collect tests (diagnose hangs)"
currently pipes pytest into tail which masks pytest's exit code; update the
step's run command to enable pipefail (e.g., run under bash with set -o pipefail
or use bash -lc "set -o pipefail && ...") so that the exit code from pytest (the
command that may fail during collection) is propagated instead of being
swallowed by tail; retain the diagnostic behavior (xvfb-run, pytest args, and
tail -20) but ensure pipefail is set so failures in pytest cause the step to
fail.

Comment thread .github/workflows/test.yml Outdated
Comment on lines +64 to +66
- name: Collect tests (diagnose hangs)
timeout-minutes: 5
run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Piping through tail swallows the pytest exit code.

If pytest --co fails (e.g., an import error during collection), the step will still pass because tail -20 exits 0. Since this is a diagnostic step meant to surface hangs, a silent failure here would be misleading.

Consider using pipefail to propagate the exit code:

Proposed fix
-        run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20
+        run: |
+          set -o pipefail
+          xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Collect tests (diagnose hangs)
timeout-minutes: 5
run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20
- name: Collect tests (diagnose hangs)
timeout-minutes: 5
run: |
set -o pipefail
xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20
🤖 Prompt for AI Agents
In @.github/workflows/test.yml around lines 64 - 66, The workflow step "Collect
tests (diagnose hangs)" currently pipes pytest into tail which masks pytest's
exit code; update the step's run command to enable pipefail (e.g., run under
bash with set -o pipefail or use bash -lc "set -o pipefail && ...") so that the
exit code from pytest (the command that may fail during collection) is
propagated instead of being swallowed by tail; retain the diagnostic behavior
(xvfb-run, pytest args, and tail -20) but ensure pipefail is set so failures in
pytest cause the step to fail.

royerloic and others added 2 commits February 13, 2026 06:33
Root cause: `is_llm_available()` (used in 10+ `@pytest.mark.skipif`
decorators) called `get_litemind_api()` → `set_api_key()` which
showed a blocking Qt dialog when API keys were missing. On CI this
dialog hanged indefinitely, preventing test collection.

- Add `_is_interactive()` guard to `set_api_key()` that detects
  pytest and CI environments and skips the dialog
- Add early API key check in `is_llm_available()` to avoid expensive
  initialisation when no keys are set
- Remove diagnostic CI steps no longer needed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scikit-image >= 0.24 added `max_size` parameter, and >= 0.26
deprecated `min_size`. Use try/except to support both old and new
versions (Python 3.10 gets older scikit-image, Python 3.12 gets newer).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 22-29: Replace the non-standard CUDA_VISIBLE_DEVICES value "-1"
with the CUDA-documented standard empty string "" in the GitHub Actions
environment block; update the env entry for CUDA_VISIBLE_DEVICES so it is set to
an empty string (CUDA_VISIBLE_DEVICES: "") to follow CUDA conventions and avoid
potential device-detection confusion (leave TORCH_CUDA_ARCH_LIST and USE_CUDA
unchanged).

In `@src/napari_chatgpt/utils/napari/test/napari_viewer_info_test.py`:
- Around line 33-38: The test currently mismatches scikit-image versions and
uses opposite semantics (max_size vs min_size); update the code to explicitly
detect the skimage version (skimage.__version__) and ensure consistent "remove
objects smaller than N" behavior: for versions that support min_size call
remove_small_objects(clear_border(bw), min_size=20), and for versions where
min_size was removed (>=0.26) emulate the same semantics by labeling
(skimage.measure.label) and removing small components via np.bincount to filter
out labels smaller than 20 so the test result is identical across versions;
reference remove_small_objects, clear_border, label and the bw test image when
making the change.
🧹 Nitpick comments (2)
src/napari_chatgpt/omega_agent/tools/napari/delegated_code/classic.py (1)

174-183: Third copy of the version-compat block — use remove_small_segments instead.

This file already has the min_segment_size parameter. The utility function remove_small_segments() in src/napari_chatgpt/utils/segmentation/remove_small_segments.py encapsulates this exact logic. Replacing this block with a call to it would eliminate the duplication.

Suggested refactor
+    from napari_chatgpt.utils.segmentation.remove_small_segments import remove_small_segments
+
     # Remove small objects:
-    try:
-        # scikit-image >= 0.24 (min_size deprecated in 0.26)
-        labels = remove_small_objects(
-            labels, max_size=min_segment_size - 1
-        )
-    except TypeError:
-        # scikit-image < 0.24
-        labels = remove_small_objects(
-            labels, min_size=min_segment_size
-        )
+    labels = remove_small_segments(labels, min_segment_size)

You can also remove remove_small_objects from the imports at the top of the file if no other call site uses it.

src/napari_chatgpt/llm/litemind_api.py (1)

56-63: Good early-exit to avoid expensive initialization and potential Qt dialog.

Minor note: the provider list ["OpenAI", "Anthropic", "Gemini"] is duplicated from api_key.py's __api_key_names dict. If providers are added later, both locations need updating. Consider exposing the key names from api_key.py (e.g., a public get_supported_providers() function) to keep them in sync.

Comment on lines +22 to +29
env:
# Prevent TensorFlow/PyTorch from trying to initialise CUDA on
# CPU-only runners (CUDA pip packages are present but no GPU
# driver exists, which causes an indefinite hang).
CUDA_VISIBLE_DEVICES: "-1"
# Force CPU-only mode for PyTorch and TensorFlow
TORCH_CUDA_ARCH_LIST: ""
USE_CUDA: "0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/test.yml | sed -n '20,35p'

Repository: royerlab/napari-chatgpt

Length of output: 703


🌐 Web query:

CUDA_VISIBLE_DEVICES empty string vs -1 behavior TensorFlow PyTorch

💡 Result:

CUDA semantics (applies to both TensorFlow and PyTorch)

  • CUDA_VISIBLE_DEVICES="" (empty string) ⇒ no GPUs are visible to CUDA. [1]
  • CUDA_VISIBLE_DEVICES="-1"-1 is an invalid device index; CUDA’s rule is “stop at the first invalid index,” so with -1 as the first entry, no GPUs are visible (effectively the same outcome as empty string). [1]

Why people observe different behavior in TensorFlow / PyTorch

In practice, “"" didn’t work but -1 did” is usually because the env var was set too late (after the framework or a dependency already initialized CUDA / queried devices):

  • PyTorch: set CUDA_VISIBLE_DEVICES before importing torch (or set it in the shell). PyTorch maintainers explicitly warn that setting it after imports may have no effect. [3]
  • TensorFlow: device visibility must be configured before GPUs are initialized (TF notes this for tf.config.set_visible_devices; the same timing principle applies to CUDA_VISIBLE_DEVICES). [2]

Practical recommendations

  • To disable GPUs reliably, prefer setting it in the shell:
    • CUDA_VISIBLE_DEVICES="" python your_script.py (hide all GPUs) [1]
    • CUDA_VISIBLE_DEVICES="-1" python your_script.py (also hides all GPUs via “invalid index first”) [1]
  • If setting inside Python, do it at the very top, before importing TensorFlow/PyTorch (and ideally before any library that might touch CUDA). [2][3]

Related gotcha: reindexing

If you set CUDA_VISIBLE_DEVICES=2, the one visible physical GPU is renumbered as cuda:0 / /GPU:0 inside the process. [1]

Sources: NVIDIA CUDA Programming Guide on CUDA_VISIBLE_DEVICES [1]; TensorFlow GPU guide (initialization timing) [2]; PyTorch forum guidance on setting env vars before imports [3].


🌐 Web query:

CUDA_VISIBLE_DEVICES -1 invalid ordinal warnings

💡 Result:

CUDA_VISIBLE_DEVICES=-1 makes CUDA see no valid GPUs, because -1 is an invalid GPU index; when CUDA hits an invalid index in the list, it exposes only devices that appear before it—so with -1 first, that means zero visible devices. Any later attempt by your framework/code to use cuda:0, cuda:1, etc. then triggers “invalid device ordinal” warnings/errors. [1]

What to do instead:

  • To disable all GPUs (recommended): set an empty string
    • bash: export CUDA_VISIBLE_DEVICES="" [1]
    • (this is the CUDA-documented way to hide all GPUs) [1]
  • If you want GPU N, and only that GPU:
    • bash: export CUDA_VISIBLE_DEVICES=N
    • then inside the program, use cuda:0, because CUDA renumbers visible devices starting at 0. [1]
  • PyTorch-specific gotcha: if you run with CUDA_VISIBLE_DEVICES=1, your script must use cuda:0 (not cuda:1), or you’ll get invalid device ordinal. [2]

Also ensure you set CUDA_VISIBLE_DEVICES before importing GPU-using libraries (PyTorch/TensorFlow), since they may initialize CUDA on import. [2]

Sources:
[1] NVIDIA CUDA Programming Guide — CUDA_VISIBLE_DEVICES behavior and empty-string disabling GPUs
[2] PyTorch Forums — explanation of device renumbering after masking with CUDA_VISIBLE_DEVICES


Change CUDA_VISIBLE_DEVICES to empty string for standard CUDA practice — while "-1" effectively hides all GPUs (by being an invalid device index that CUDA stops at), the CUDA-documented standard approach is to use an empty string (""). Using "" aligns with CUDA conventions and avoids potential confusion or issues if code later attempts to access CUDA devices.

Suggested fix
-      CUDA_VISIBLE_DEVICES: "-1"
+      CUDA_VISIBLE_DEVICES: ""
🤖 Prompt for AI Agents
In @.github/workflows/test.yml around lines 22 - 29, Replace the non-standard
CUDA_VISIBLE_DEVICES value "-1" with the CUDA-documented standard empty string
"" in the GitHub Actions environment block; update the env entry for
CUDA_VISIBLE_DEVICES so it is set to an empty string (CUDA_VISIBLE_DEVICES: "")
to follow CUDA conventions and avoid potential device-detection confusion (leave
TORCH_CUDA_ARCH_LIST and USE_CUDA unchanged).

Comment on lines +33 to +38
try:
# scikit-image >= 0.24 (min_size deprecated in 0.26)
cleared = remove_small_objects(clear_border(bw), max_size=19)
except TypeError:
# scikit-image < 0.24
cleared = remove_small_objects(clear_border(bw), min_size=20)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

scikit-image remove_small_objects max_size parameter changelog 0.24

💡 Result:

In scikit-image 0.24.0 (2024-06-18), skimage.morphology.remove_small_objects does not have a max_size parameter—its signature is:

  • remove_small_objects(ar, min_size=64, connectivity=1, *, out=None) and it “removes objects smaller than min_size”. [1]

The max_size keyword (and the corresponding deprecation of min_size) was introduced later—“Added in version 0.26” (per the current/stable docs), i.e. not in the 0.24 changelog/release notes. [2]

Sources:
[1] scikit-image 0.24.0 API docs (remove_small_objects signature shows min_size) https://scikit-image.org/docs/0.24.x/api/skimage.morphology.html
[2] scikit-image stable API docs (remove_small_objects shows max_size, “Added in version 0.26”, min_size deprecated) https://scikit-image.org/docs/stable/api/skimage.morphology


The comment is misleading about the version timeline, and the code has a semantic issue with parameter usage.

The max_size parameter was introduced in scikit-image 0.26, not 0.24. In versions 0.24 and 0.25, the try block will always raise TypeError and silently fall back to the except clause—the code works only by accident.

Additionally, max_size and min_size have opposite semantics: min_size removes objects smaller than the threshold, while max_size removes objects larger than the threshold. Using max_size=19 versus min_size=20 produces functionally different results, suggesting a logic error. The comment should clarify the version constraint and the semantic difference should be addressed.

🤖 Prompt for AI Agents
In `@src/napari_chatgpt/utils/napari/test/napari_viewer_info_test.py` around lines
33 - 38, The test currently mismatches scikit-image versions and uses opposite
semantics (max_size vs min_size); update the code to explicitly detect the
skimage version (skimage.__version__) and ensure consistent "remove objects
smaller than N" behavior: for versions that support min_size call
remove_small_objects(clear_border(bw), min_size=20), and for versions where
min_size was removed (>=0.26) emulate the same semantics by labeling
(skimage.measure.label) and removing small components via np.bincount to filter
out labels smaller than 20 so the test result is identical across versions;
reference remove_small_objects, clear_border, label and the bw test image when
making the change.

@royerloic royerloic merged commit a14d738 into main Feb 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant