fix(ci): prevent CUDA hang and add test timeouts#76
Conversation
TensorFlow/PyTorch CUDA initialization hangs indefinitely on GitHub Actions CPU-only runners because CUDA pip packages are present but no GPU driver exists. Set CUDA_VISIBLE_DEVICES="" to disable GPU detection, add 30-minute job timeout, and 5-minute per-test timeout via pytest-timeout. Also mark network-dependent tests as integration so they are excluded from CI runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughAdds CI job-level timeout and CPU-only env vars, introduces pytest timeout in CI and test deps, marks several tests as integration tests, replaces direct-import checks with importlib.util.find_spec for package detection, and adds scikit-image-version-aware calls to remove_small_objects. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Replace `import stardist` / `import cellpose` in check_*_installed() with `importlib.util.find_spec()` which detects the package without loading it. This prevents TensorFlow/PyTorch from being imported at pytest collection time, which hangs on CI runners without GPU drivers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add import verification and collect-only diagnostic steps with their own timeouts to pinpoint where the hang occurs. Use CUDA_VISIBLE_DEVICES=-1 (explicit no-GPU) instead of empty string, plus TORCH_CUDA_ARCH_LIST and USE_CUDA=0 to further prevent CUDA initialization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 64-66: The workflow step "Collect tests (diagnose hangs)"
currently pipes pytest into tail which masks pytest's exit code; update the
step's run command to enable pipefail (e.g., run under bash with set -o pipefail
or use bash -lc "set -o pipefail && ...") so that the exit code from pytest (the
command that may fail during collection) is propagated instead of being
swallowed by tail; retain the diagnostic behavior (xvfb-run, pytest args, and
tail -20) but ensure pipefail is set so failures in pytest cause the step to
fail.
| - name: Collect tests (diagnose hangs) | ||
| timeout-minutes: 5 | ||
| run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20 |
There was a problem hiding this comment.
Piping through tail swallows the pytest exit code.
If pytest --co fails (e.g., an import error during collection), the step will still pass because tail -20 exits 0. Since this is a diagnostic step meant to surface hangs, a silent failure here would be misleading.
Consider using pipefail to propagate the exit code:
Proposed fix
- run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20
+ run: |
+ set -o pipefail
+ xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Collect tests (diagnose hangs) | |
| timeout-minutes: 5 | |
| run: xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20 | |
| - name: Collect tests (diagnose hangs) | |
| timeout-minutes: 5 | |
| run: | | |
| set -o pipefail | |
| xvfb-run pytest src/ --co --timeout=120 -k "not integration and not llm and not slow" 2>&1 | tail -20 |
🤖 Prompt for AI Agents
In @.github/workflows/test.yml around lines 64 - 66, The workflow step "Collect
tests (diagnose hangs)" currently pipes pytest into tail which masks pytest's
exit code; update the step's run command to enable pipefail (e.g., run under
bash with set -o pipefail or use bash -lc "set -o pipefail && ...") so that the
exit code from pytest (the command that may fail during collection) is
propagated instead of being swallowed by tail; retain the diagnostic behavior
(xvfb-run, pytest args, and tail -20) but ensure pipefail is set so failures in
pytest cause the step to fail.
Root cause: `is_llm_available()` (used in 10+ `@pytest.mark.skipif` decorators) called `get_litemind_api()` → `set_api_key()` which showed a blocking Qt dialog when API keys were missing. On CI this dialog hanged indefinitely, preventing test collection. - Add `_is_interactive()` guard to `set_api_key()` that detects pytest and CI environments and skips the dialog - Add early API key check in `is_llm_available()` to avoid expensive initialisation when no keys are set - Remove diagnostic CI steps no longer needed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scikit-image >= 0.24 added `max_size` parameter, and >= 0.26 deprecated `min_size`. Use try/except to support both old and new versions (Python 3.10 gets older scikit-image, Python 3.12 gets newer). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 22-29: Replace the non-standard CUDA_VISIBLE_DEVICES value "-1"
with the CUDA-documented standard empty string "" in the GitHub Actions
environment block; update the env entry for CUDA_VISIBLE_DEVICES so it is set to
an empty string (CUDA_VISIBLE_DEVICES: "") to follow CUDA conventions and avoid
potential device-detection confusion (leave TORCH_CUDA_ARCH_LIST and USE_CUDA
unchanged).
In `@src/napari_chatgpt/utils/napari/test/napari_viewer_info_test.py`:
- Around line 33-38: The test currently mismatches scikit-image versions and
uses opposite semantics (max_size vs min_size); update the code to explicitly
detect the skimage version (skimage.__version__) and ensure consistent "remove
objects smaller than N" behavior: for versions that support min_size call
remove_small_objects(clear_border(bw), min_size=20), and for versions where
min_size was removed (>=0.26) emulate the same semantics by labeling
(skimage.measure.label) and removing small components via np.bincount to filter
out labels smaller than 20 so the test result is identical across versions;
reference remove_small_objects, clear_border, label and the bw test image when
making the change.
🧹 Nitpick comments (2)
src/napari_chatgpt/omega_agent/tools/napari/delegated_code/classic.py (1)
174-183: Third copy of the version-compat block — useremove_small_segmentsinstead.This file already has the
min_segment_sizeparameter. The utility functionremove_small_segments()insrc/napari_chatgpt/utils/segmentation/remove_small_segments.pyencapsulates this exact logic. Replacing this block with a call to it would eliminate the duplication.Suggested refactor
+ from napari_chatgpt.utils.segmentation.remove_small_segments import remove_small_segments + # Remove small objects: - try: - # scikit-image >= 0.24 (min_size deprecated in 0.26) - labels = remove_small_objects( - labels, max_size=min_segment_size - 1 - ) - except TypeError: - # scikit-image < 0.24 - labels = remove_small_objects( - labels, min_size=min_segment_size - ) + labels = remove_small_segments(labels, min_segment_size)You can also remove
remove_small_objectsfrom the imports at the top of the file if no other call site uses it.src/napari_chatgpt/llm/litemind_api.py (1)
56-63: Good early-exit to avoid expensive initialization and potential Qt dialog.Minor note: the provider list
["OpenAI", "Anthropic", "Gemini"]is duplicated fromapi_key.py's__api_key_namesdict. If providers are added later, both locations need updating. Consider exposing the key names fromapi_key.py(e.g., a publicget_supported_providers()function) to keep them in sync.
| env: | ||
| # Prevent TensorFlow/PyTorch from trying to initialise CUDA on | ||
| # CPU-only runners (CUDA pip packages are present but no GPU | ||
| # driver exists, which causes an indefinite hang). | ||
| CUDA_VISIBLE_DEVICES: "-1" | ||
| # Force CPU-only mode for PyTorch and TensorFlow | ||
| TORCH_CUDA_ARCH_LIST: "" | ||
| USE_CUDA: "0" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/test.yml | sed -n '20,35p'Repository: royerlab/napari-chatgpt
Length of output: 703
🌐 Web query:
CUDA_VISIBLE_DEVICES empty string vs -1 behavior TensorFlow PyTorch
💡 Result:
CUDA semantics (applies to both TensorFlow and PyTorch)
CUDA_VISIBLE_DEVICES=""(empty string) ⇒ no GPUs are visible to CUDA. [1]CUDA_VISIBLE_DEVICES="-1"⇒-1is an invalid device index; CUDA’s rule is “stop at the first invalid index,” so with-1as the first entry, no GPUs are visible (effectively the same outcome as empty string). [1]
Why people observe different behavior in TensorFlow / PyTorch
In practice, “"" didn’t work but -1 did” is usually because the env var was set too late (after the framework or a dependency already initialized CUDA / queried devices):
- PyTorch: set
CUDA_VISIBLE_DEVICESbefore importingtorch(or set it in the shell). PyTorch maintainers explicitly warn that setting it after imports may have no effect. [3] - TensorFlow: device visibility must be configured before GPUs are initialized (TF notes this for
tf.config.set_visible_devices; the same timing principle applies toCUDA_VISIBLE_DEVICES). [2]
Practical recommendations
- To disable GPUs reliably, prefer setting it in the shell:
CUDA_VISIBLE_DEVICES="" python your_script.py(hide all GPUs) [1]CUDA_VISIBLE_DEVICES="-1" python your_script.py(also hides all GPUs via “invalid index first”) [1]
- If setting inside Python, do it at the very top, before importing TensorFlow/PyTorch (and ideally before any library that might touch CUDA). [2][3]
Related gotcha: reindexing
If you set CUDA_VISIBLE_DEVICES=2, the one visible physical GPU is renumbered as cuda:0 / /GPU:0 inside the process. [1]
Sources: NVIDIA CUDA Programming Guide on CUDA_VISIBLE_DEVICES [1]; TensorFlow GPU guide (initialization timing) [2]; PyTorch forum guidance on setting env vars before imports [3].
🌐 Web query:
CUDA_VISIBLE_DEVICES -1 invalid ordinal warnings
💡 Result:
CUDA_VISIBLE_DEVICES=-1 makes CUDA see no valid GPUs, because -1 is an invalid GPU index; when CUDA hits an invalid index in the list, it exposes only devices that appear before it—so with -1 first, that means zero visible devices. Any later attempt by your framework/code to use cuda:0, cuda:1, etc. then triggers “invalid device ordinal” warnings/errors. [1]
What to do instead:
- To disable all GPUs (recommended): set an empty string
- bash:
export CUDA_VISIBLE_DEVICES=""[1] - (this is the CUDA-documented way to hide all GPUs) [1]
- bash:
- If you want GPU N, and only that GPU:
- bash:
export CUDA_VISIBLE_DEVICES=N - then inside the program, use
cuda:0, because CUDA renumbers visible devices starting at 0. [1]
- bash:
- PyTorch-specific gotcha: if you run with
CUDA_VISIBLE_DEVICES=1, your script must usecuda:0(notcuda:1), or you’ll getinvalid device ordinal. [2]
Also ensure you set CUDA_VISIBLE_DEVICES before importing GPU-using libraries (PyTorch/TensorFlow), since they may initialize CUDA on import. [2]
Sources:
[1] NVIDIA CUDA Programming Guide — CUDA_VISIBLE_DEVICES behavior and empty-string disabling GPUs
[2] PyTorch Forums — explanation of device renumbering after masking with CUDA_VISIBLE_DEVICES
Change CUDA_VISIBLE_DEVICES to empty string for standard CUDA practice — while "-1" effectively hides all GPUs (by being an invalid device index that CUDA stops at), the CUDA-documented standard approach is to use an empty string (""). Using "" aligns with CUDA conventions and avoids potential confusion or issues if code later attempts to access CUDA devices.
Suggested fix
- CUDA_VISIBLE_DEVICES: "-1"
+ CUDA_VISIBLE_DEVICES: ""🤖 Prompt for AI Agents
In @.github/workflows/test.yml around lines 22 - 29, Replace the non-standard
CUDA_VISIBLE_DEVICES value "-1" with the CUDA-documented standard empty string
"" in the GitHub Actions environment block; update the env entry for
CUDA_VISIBLE_DEVICES so it is set to an empty string (CUDA_VISIBLE_DEVICES: "")
to follow CUDA conventions and avoid potential device-detection confusion (leave
TORCH_CUDA_ARCH_LIST and USE_CUDA unchanged).
| try: | ||
| # scikit-image >= 0.24 (min_size deprecated in 0.26) | ||
| cleared = remove_small_objects(clear_border(bw), max_size=19) | ||
| except TypeError: | ||
| # scikit-image < 0.24 | ||
| cleared = remove_small_objects(clear_border(bw), min_size=20) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
scikit-image remove_small_objects max_size parameter changelog 0.24
💡 Result:
In scikit-image 0.24.0 (2024-06-18), skimage.morphology.remove_small_objects does not have a max_size parameter—its signature is:
remove_small_objects(ar, min_size=64, connectivity=1, *, out=None)and it “removes objects smaller thanmin_size”. [1]
The max_size keyword (and the corresponding deprecation of min_size) was introduced later—“Added in version 0.26” (per the current/stable docs), i.e. not in the 0.24 changelog/release notes. [2]
Sources:
[1] scikit-image 0.24.0 API docs (remove_small_objects signature shows min_size) https://scikit-image.org/docs/0.24.x/api/skimage.morphology.html
[2] scikit-image stable API docs (remove_small_objects shows max_size, “Added in version 0.26”, min_size deprecated) https://scikit-image.org/docs/stable/api/skimage.morphology
The comment is misleading about the version timeline, and the code has a semantic issue with parameter usage.
The max_size parameter was introduced in scikit-image 0.26, not 0.24. In versions 0.24 and 0.25, the try block will always raise TypeError and silently fall back to the except clause—the code works only by accident.
Additionally, max_size and min_size have opposite semantics: min_size removes objects smaller than the threshold, while max_size removes objects larger than the threshold. Using max_size=19 versus min_size=20 produces functionally different results, suggesting a logic error. The comment should clarify the version constraint and the semantic difference should be addressed.
🤖 Prompt for AI Agents
In `@src/napari_chatgpt/utils/napari/test/napari_viewer_info_test.py` around lines
33 - 38, The test currently mismatches scikit-image versions and uses opposite
semantics (max_size vs min_size); update the code to explicitly detect the
skimage version (skimage.__version__) and ensure consistent "remove objects
smaller than N" behavior: for versions that support min_size call
remove_small_objects(clear_border(bw), min_size=20), and for versions where
min_size was removed (>=0.26) emulate the same semantics by labeling
(skimage.measure.label) and removing small components via np.bincount to filter
out labels smaller than 20 so the test result is identical across versions;
reference remove_small_objects, clear_border, label and the bw test image when
making the change.
Summary
CUDA_VISIBLE_DEVICES=""in CI env to prevent TensorFlow/PyTorch from trying to initialize CUDA on CPU-only runners (CUDA pip packages are present but no GPU driver, causing an indefinite hang during test collection)timeout-minutes: 30to the CI job to prevent 6-hour runs (GitHub Actions default)--timeout=300(5 min per test) viapytest-timeoutto kill individual hanging testspytest-timeoutto testing dependencies@pytest.mark.integrationso they are excluded from CI:test_web_search_tool,test_wikipedia_search_tool,test_add_image_cellRoot cause
All recent CI runs were "cancelled" after ~6 hours. The test command
pytest src/ -v -k "not integration and not llm and not slow"collects tests includingcellpose_test.pyandstardist_test.py, whose@pytest.mark.skipif(not check_*_installed(), ...)decorators importcellpose(→ PyTorch → CUDA) andstardist(→ csbdeep → TensorFlow → CUDA) at collection time. On GitHub Actions runners, the CUDA pip packages (nvidia-cuda-nvcc-cu12, etc.) are installed but no GPU driver exists, causing TensorFlow's CUDA initialization to hang indefinitely.Test plan
CUDA_VISIBLE_DEVICES=""and--timeout=300in 3m28s🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
Chores
Bug Fixes / Reliability