You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The previous CI hardening added reset/diagnostic handling between shared-server test suites. New CI logs show that the reset itself is working, but a heavier Ollama native tool-calling path can still wedge hosted Windows/macOS runners: the request times out after 500s and can leave the Qwen tool-calling backend reported as "busy" / "in_use". That then causes misleading follow-on failures in later streaming-error tests.
This PR narrows that failure mode without removing the broader CI coverage.
Changes
Gate test_026_anthropic_messages_tool_calling on GitHub-hosted Windows/macOS.
Linux keeps the heavy native Qwen tool-calling coverage.
Windows/macOS can still opt in with LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1.
The lighter Ollama tool-streaming coverage remains active.
Make reset_server_state.py verify the unload postcondition via /api/ps.
/api/v1/unload returning is no longer treated as sufficient by itself.
The helper checks that no models remain running before reporting reset completion.
Make server_streaming_errors.py::test_003 skip when its unload precondition is not met.
This avoids a secondary timeout when a previous test has already left the server busy.
The primary failure remains attributed to the test that wedged the backend.
Rationale
The failing path is not a normal lightweight API compatibility check. It uses the native tool-calling model Qwen3-4B-Instruct-2507-GGUF, which is too heavy and unstable for GitHub-hosted Windows/macOS CI in this shared-server workflow.
Skipping only this heavy path on hosted Windows/macOS is a CI-stability tradeoff, not a broad coverage removal:
Linux still exercises the heavy native tool-calling behavior.
Windows/macOS still run the rest of the Ollama API suite.
The test can be explicitly re-enabled with an env var.
The reset helper now reports the real postcondition instead of giving false confidence.
Review: PR #2399 — follow-up on hosted Ollama tool-calling timeouts
I read all three changed files plus the surrounding context (reset_server_state.py, server_streaming_errors.py, test_ollama.py, the /api/ps handler, and the CI workflow invocations). This is a well-scoped, well-reasoned CI-stability change. Below is my feedback.
✅ What's good
Narrow, reversible gating._skip_slow_hosted_ci_tool_calling() only skips the one heavy native Qwen path on hosted Windows/macOS (GITHUB_ACTIONS=true + sys.platform in (darwin, win32)), keeps full coverage on Linux, and provides an explicit LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1 escape hatch. The lighter Ollama tool-streaming coverage stays active. This matches the stated tradeoff and doesn't silently erase coverage.
Real postcondition verification. Polling /api/ps until models == [] is the right fix — /api/v1/unload returning 200 isn't proof the backend actually drained. I confirmed handle_ps reflects router_->get_all_loaded_models(), so an empty list is a meaningful signal.
Safe blast radius for the strict reset. I checked every workflow call site — all 21 invocations of reset_server_state use --best-effort, and reset_server_state() is only ever invoked through main() (the server_cli2.py matches were comments/other cleanup). So the new RuntimeError on a wedged backend is caught, prints diagnostics, and returns 0. It adds visibility, not new hard CI failures. Good.
Skip-not-fail on unmet precondition in test_003 correctly keeps the blame on the test that actually wedged the backend rather than producing a misleading secondary timeout.
🟡 Minor nits (non-blocking)
Per-request timeout exceeds the wait budget in the streaming helper. In server_streaming_errors.py, _running_models() uses timeout=TIMEOUT_DEFAULT (= 60s), but _wait_for_no_running_models(timeout=15) is meant to be a ~15s budget. A single hung GET /api/ps can therefore block up to 60s, blowing past the 15s deadline (the while time.time() < deadline check only runs between requests). The reset helper got this right (per-request timeout=5 under a 60s window). Suggest dropping the per-request timeout to ~5s here for consistency. Fix this →
Duplicated helpers._running_models / _wait_for_no_running_models now exist in both reset_server_state.py (raises) and server_streaming_errors.py (returns bool). The semantics differ (raise vs. skip), so I wouldn't force a merge, but the /api/ps polling logic could live as a shared util that each caller wraps. Optional.
Cosmetic string reflow. Splitting "Run the calculator_calculate tool with expression set to 1+1" into a parenthesized concatenation is an unrelated reformat bundled into the diff. Harmless, just noise — fine to leave.
🔎 One thing worth confirming
/api/ps reports loaded models (LRU-resident), not strictly busy ones. After an unload, that's the same thing, so the logic holds for the reset/precondition path. Just flagging that if a future caller reuses _wait_for_no_running_modelswithout a preceding unload, "no running models" really means "nothing loaded at all" — which may be stricter than intended. Not an issue for this PR's usage.
Verdict
Looks good to me. The approach is sound, the blast radius is contained by --best-effort, and the coverage tradeoff is explicit and documented. Only nit #1 is a genuine (minor) correctness inconsistency; the rest are optional polish.
· branch fl0rianr/CI_cli_endpoint_testtimeout
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #2365.
The previous CI hardening added reset/diagnostic handling between shared-server test suites. New CI logs show that the reset itself is working, but a heavier Ollama native tool-calling path can still wedge hosted Windows/macOS runners: the request times out after 500s and can leave the Qwen tool-calling backend reported as "busy" / "in_use". That then causes misleading follow-on failures in later streaming-error tests.
This PR narrows that failure mode without removing the broader CI coverage.
Changes
Gate test_026_anthropic_messages_tool_calling on GitHub-hosted Windows/macOS.
Make reset_server_state.py verify the unload postcondition via /api/ps.
Make server_streaming_errors.py::test_003 skip when its unload precondition is not met.
Rationale
The failing path is not a normal lightweight API compatibility check. It uses the native tool-calling model Qwen3-4B-Instruct-2507-GGUF, which is too heavy and unstable for GitHub-hosted Windows/macOS CI in this shared-server workflow.
Skipping only this heavy path on hosted Windows/macOS is a CI-stability tradeoff, not a broad coverage removal: