test/ci: follow up on hosted Ollama tool-calling timeouts by fl0rianr · Pull Request #2399 · lemonade-sdk/lemonade

fl0rianr · 2026-06-24T12:41:55Z

Summary

Follow-up to #2365.

The previous CI hardening added reset/diagnostic handling between shared-server test suites. New CI logs show that the reset itself is working, but a heavier Ollama native tool-calling path can still wedge hosted Windows/macOS runners: the request times out after 500s and can leave the Qwen tool-calling backend reported as "busy" / "in_use". That then causes misleading follow-on failures in later streaming-error tests.

This PR narrows that failure mode without removing the broader CI coverage.

Changes

Gate test_026_anthropic_messages_tool_calling on GitHub-hosted Windows/macOS.
- Linux keeps the heavy native Qwen tool-calling coverage.
- Windows/macOS can still opt in with LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1.
- The lighter Ollama tool-streaming coverage remains active.
Make reset_server_state.py verify the unload postcondition via /api/ps.
- /api/v1/unload returning is no longer treated as sufficient by itself.
- The helper checks that no models remain running before reporting reset completion.
Make server_streaming_errors.py::test_003 skip when its unload precondition is not met.
- This avoids a secondary timeout when a previous test has already left the server busy.
- The primary failure remains attributed to the test that wedged the backend.

Rationale

The failing path is not a normal lightweight API compatibility check. It uses the native tool-calling model Qwen3-4B-Instruct-2507-GGUF, which is too heavy and unstable for GitHub-hosted Windows/macOS CI in this shared-server workflow.

Skipping only this heavy path on hosted Windows/macOS is a CI-stability tradeoff, not a broad coverage removal:

Linux still exercises the heavy native tool-calling behavior.
Windows/macOS still run the rest of the Ollama API suite.
The test can be explicitly re-enabled with an env var.
The reset helper now reports the real postcondition instead of giving false confidence.

kenvandine · 2026-06-24T14:23:27Z

@claude please review this

github-actions · 2026-06-24T14:23:43Z

Claude finished @kenvandine's task in 2m 8s —— View job

Review: PR #2399 — follow-up on hosted Ollama tool-calling timeouts

I read all three changed files plus the surrounding context (reset_server_state.py, server_streaming_errors.py, test_ollama.py, the /api/ps handler, and the CI workflow invocations). This is a well-scoped, well-reasoned CI-stability change. Below is my feedback.

✅ What's good

Narrow, reversible gating. _skip_slow_hosted_ci_tool_calling() only skips the one heavy native Qwen path on hosted Windows/macOS (GITHUB_ACTIONS=true + sys.platform in (darwin, win32)), keeps full coverage on Linux, and provides an explicit LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1 escape hatch. The lighter Ollama tool-streaming coverage stays active. This matches the stated tradeoff and doesn't silently erase coverage.
Real postcondition verification. Polling /api/ps until models == [] is the right fix — /api/v1/unload returning 200 isn't proof the backend actually drained. I confirmed handle_ps reflects router_->get_all_loaded_models(), so an empty list is a meaningful signal.
Safe blast radius for the strict reset. I checked every workflow call site — all 21 invocations of reset_server_state use --best-effort, and reset_server_state() is only ever invoked through main() (the server_cli2.py matches were comments/other cleanup). So the new RuntimeError on a wedged backend is caught, prints diagnostics, and returns 0. It adds visibility, not new hard CI failures. Good.
Skip-not-fail on unmet precondition in test_003 correctly keeps the blame on the test that actually wedged the backend rather than producing a misleading secondary timeout.

🟡 Minor nits (non-blocking)

Per-request timeout exceeds the wait budget in the streaming helper. In server_streaming_errors.py, _running_models() uses timeout=TIMEOUT_DEFAULT (= 60s), but _wait_for_no_running_models(timeout=15) is meant to be a ~15s budget. A single hung GET /api/ps can therefore block up to 60s, blowing past the 15s deadline (the while time.time() < deadline check only runs between requests). The reset helper got this right (per-request timeout=5 under a 60s window). Suggest dropping the per-request timeout to ~5s here for consistency. Fix this →
Duplicated helpers. _running_models / _wait_for_no_running_models now exist in both reset_server_state.py (raises) and server_streaming_errors.py (returns bool). The semantics differ (raise vs. skip), so I wouldn't force a merge, but the /api/ps polling logic could live as a shared util that each caller wraps. Optional.
Cosmetic string reflow. Splitting "Run the calculator_calculate tool with expression set to 1+1" into a parenthesized concatenation is an unrelated reformat bundled into the diff. Harmless, just noise — fine to leave.

🔎 One thing worth confirming

/api/ps reports loaded models (LRU-resident), not strictly busy ones. After an unload, that's the same thing, so the logic holds for the reset/precondition path. Just flagging that if a future caller reuses _wait_for_no_running_models without a preceding unload, "no running models" really means "nothing loaded at all" — which may be stricter than intended. Not an issue for this PR's usage.

Verdict

Looks good to me. The approach is sound, the blast radius is contained by --best-effort, and the coverage tradeoff is explicit and documented. Only nit #1 is a genuine (minor) correctness inconsistency; the rest are optional polish.
· branch fl0rianr/CI_cli_endpoint_testtimeout

github-actions Bot added area::ci CI / GitHub Actions / self-hosted runner infrastructure enhancement New feature or request labels Jun 24, 2026

test/ci: stabilize hosted Ollama tool-calling flakes

474f888

fl0rianr force-pushed the fl0rianr/CI_cli_endpoint_testtimeout branch from 1c9a1ab to 474f888 Compare June 24, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test/ci: follow up on hosted Ollama tool-calling timeouts#2399

test/ci: follow up on hosted Ollama tool-calling timeouts#2399
fl0rianr wants to merge 1 commit into
mainfrom
fl0rianr/CI_cli_endpoint_testtimeout

fl0rianr commented Jun 24, 2026

Uh oh!

kenvandine commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fl0rianr commented Jun 24, 2026

Summary

Changes

Rationale

Uh oh!

kenvandine commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #2399 — follow-up on hosted Ollama tool-calling timeouts

✅ What's good

🟡 Minor nits (non-blocking)

🔎 One thing worth confirming

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 24, 2026 •

edited

Loading