Skip to content

test/ci: follow up on hosted Ollama tool-calling timeouts#2399

Open
fl0rianr wants to merge 1 commit into
mainfrom
fl0rianr/CI_cli_endpoint_testtimeout
Open

test/ci: follow up on hosted Ollama tool-calling timeouts#2399
fl0rianr wants to merge 1 commit into
mainfrom
fl0rianr/CI_cli_endpoint_testtimeout

Conversation

@fl0rianr

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #2365.

The previous CI hardening added reset/diagnostic handling between shared-server test suites. New CI logs show that the reset itself is working, but a heavier Ollama native tool-calling path can still wedge hosted Windows/macOS runners: the request times out after 500s and can leave the Qwen tool-calling backend reported as "busy" / "in_use". That then causes misleading follow-on failures in later streaming-error tests.

This PR narrows that failure mode without removing the broader CI coverage.

Changes

  • Gate test_026_anthropic_messages_tool_calling on GitHub-hosted Windows/macOS.

    • Linux keeps the heavy native Qwen tool-calling coverage.
    • Windows/macOS can still opt in with LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1.
    • The lighter Ollama tool-streaming coverage remains active.
  • Make reset_server_state.py verify the unload postcondition via /api/ps.

    • /api/v1/unload returning is no longer treated as sufficient by itself.
    • The helper checks that no models remain running before reporting reset completion.
  • Make server_streaming_errors.py::test_003 skip when its unload precondition is not met.

    • This avoids a secondary timeout when a previous test has already left the server busy.
    • The primary failure remains attributed to the test that wedged the backend.

Rationale

The failing path is not a normal lightweight API compatibility check. It uses the native tool-calling model Qwen3-4B-Instruct-2507-GGUF, which is too heavy and unstable for GitHub-hosted Windows/macOS CI in this shared-server workflow.

Skipping only this heavy path on hosted Windows/macOS is a CI-stability tradeoff, not a broad coverage removal:

  • Linux still exercises the heavy native tool-calling behavior.
  • Windows/macOS still run the rest of the Ollama API suite.
  • The test can be explicitly re-enabled with an env var.
  • The reset helper now reports the real postcondition instead of giving false confidence.

@github-actions github-actions Bot added area::ci CI / GitHub Actions / self-hosted runner infrastructure enhancement New feature or request labels Jun 24, 2026
@fl0rianr fl0rianr force-pushed the fl0rianr/CI_cli_endpoint_testtimeout branch from 1c9a1ab to 474f888 Compare June 24, 2026 13:47
@kenvandine

Copy link
Copy Markdown
Member

@claude please review this

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Claude finished @kenvandine's task in 2m 8s —— View job


Review: PR #2399 — follow-up on hosted Ollama tool-calling timeouts

I read all three changed files plus the surrounding context (reset_server_state.py, server_streaming_errors.py, test_ollama.py, the /api/ps handler, and the CI workflow invocations). This is a well-scoped, well-reasoned CI-stability change. Below is my feedback.

✅ What's good

  • Narrow, reversible gating. _skip_slow_hosted_ci_tool_calling() only skips the one heavy native Qwen path on hosted Windows/macOS (GITHUB_ACTIONS=true + sys.platform in (darwin, win32)), keeps full coverage on Linux, and provides an explicit LEMONADE_RUN_HEAVY_OLLAMA_TOOL_TESTS=1 escape hatch. The lighter Ollama tool-streaming coverage stays active. This matches the stated tradeoff and doesn't silently erase coverage.
  • Real postcondition verification. Polling /api/ps until models == [] is the right fix — /api/v1/unload returning 200 isn't proof the backend actually drained. I confirmed handle_ps reflects router_->get_all_loaded_models(), so an empty list is a meaningful signal.
  • Safe blast radius for the strict reset. I checked every workflow call site — all 21 invocations of reset_server_state use --best-effort, and reset_server_state() is only ever invoked through main() (the server_cli2.py matches were comments/other cleanup). So the new RuntimeError on a wedged backend is caught, prints diagnostics, and returns 0. It adds visibility, not new hard CI failures. Good.
  • Skip-not-fail on unmet precondition in test_003 correctly keeps the blame on the test that actually wedged the backend rather than producing a misleading secondary timeout.

🟡 Minor nits (non-blocking)

  1. Per-request timeout exceeds the wait budget in the streaming helper. In server_streaming_errors.py, _running_models() uses timeout=TIMEOUT_DEFAULT (= 60s), but _wait_for_no_running_models(timeout=15) is meant to be a ~15s budget. A single hung GET /api/ps can therefore block up to 60s, blowing past the 15s deadline (the while time.time() < deadline check only runs between requests). The reset helper got this right (per-request timeout=5 under a 60s window). Suggest dropping the per-request timeout to ~5s here for consistency. Fix this →

  2. Duplicated helpers. _running_models / _wait_for_no_running_models now exist in both reset_server_state.py (raises) and server_streaming_errors.py (returns bool). The semantics differ (raise vs. skip), so I wouldn't force a merge, but the /api/ps polling logic could live as a shared util that each caller wraps. Optional.

  3. Cosmetic string reflow. Splitting "Run the calculator_calculate tool with expression set to 1+1" into a parenthesized concatenation is an unrelated reformat bundled into the diff. Harmless, just noise — fine to leave.

🔎 One thing worth confirming

/api/ps reports loaded models (LRU-resident), not strictly busy ones. After an unload, that's the same thing, so the logic holds for the reset/precondition path. Just flagging that if a future caller reuses _wait_for_no_running_models without a preceding unload, "no running models" really means "nothing loaded at all" — which may be stricter than intended. Not an issue for this PR's usage.

Verdict

Looks good to me. The approach is sound, the blast radius is contained by --best-effort, and the coverage tradeoff is explicit and documented. Only nit #1 is a genuine (minor) correctness inconsistency; the rest are optional polish.
· branch fl0rianr/CI_cli_endpoint_testtimeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area::ci CI / GitHub Actions / self-hosted runner infrastructure enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants