Skip to content

fix: add timeout to execute_workflow to prevent unbounded lock hold (closes #9051)#9438

Draft
infrahub-bug-pipeline[bot] wants to merge 5 commits into
stablefrom
ai-bug-pipeline-9051-workflow-timeout-1af5dd2ef94b0c0d
Draft

fix: add timeout to execute_workflow to prevent unbounded lock hold (closes #9051)#9438
infrahub-bug-pipeline[bot] wants to merge 5 commits into
stablefrom
ai-bug-pipeline-9051-workflow-timeout-1af5dd2ef94b0c0d

Conversation

@infrahub-bug-pipeline

@infrahub-bug-pipeline infrahub-bug-pipeline Bot commented Jun 3, 2026

Copy link
Copy Markdown

Why

WorkflowWorkerExecution.execute_workflow called Prefect's run_deployment with no timeout argument. When the workflow task worker is unreachable (e.g., not running), the call blocks indefinitely. In the /api/schema/load handler, this holds global_schema_lock forever, preventing any subsequent schema operations from acquiring the lock.

The goal is to let callers supply a timeout so that execute_workflow raises FlowRunWaitTimeout if no worker picks up the submitted flow run within the deadline, freeing the lock.

Non-goals: this PR does not add a default timeout or configure one via settings; it only wires the parameter through so callers can opt in.

Closes #9051

Fix strategy

The root cause is shallow: a missing parameter that prevents passing a timeout to Prefect. A targeted fix is appropriate — no design refactoring needed.

run_deployment(timeout=N) returns the FlowRun in its current state (SCHEDULED) when the timeout expires instead of raising FlowRunWaitTimeout. After run_deployment returns, I check whether a timeout was given and the run is not yet in a terminal state, and raise FlowRunWaitTimeout in that case to give callers the expected semantics.

What changed

  • Behavioral change: execute_workflow now accepts a timeout: float | None = None parameter. When set, if the flow run does not reach a terminal state within timeout seconds, FlowRunWaitTimeout is raised instead of blocking indefinitely.
  • InfrahubWorkflow.execute_workflow (abstract base + overloads in __init__.py): added timeout parameter.
  • WorkflowWorkerExecution.execute_workflow (overloads + impl in worker.py): added timeout, passes it to run_deployment, and raises FlowRunWaitTimeout when the run is not final after timeout.
  • WorkflowLocalExecution.execute_workflow (local.py): added timeout (accepted but ignored — local execution is synchronous and has no concept of a remote worker timeout).
  • # noqa: ASYNC109 suppressions added: the ruff rule flags timeout params on async functions suggesting asyncio.timeout() context manager, but here timeout is an API value forwarded to Prefect, not an asyncio context.

API contract: fully backward compatible — timeout defaults to None, matching previous behavior (unbounded wait).

How to review

  1. backend/infrahub/services/adapters/workflow/worker.py — the core fix: new parameter and the FlowRunWaitTimeout raise after checking response.state.is_final().
  2. backend/infrahub/services/adapters/workflow/__init__.py — abstract interface updated to include timeout.
  3. backend/infrahub/services/adapters/workflow/local.py — trivial: accept and ignore timeout to keep the interface consistent.
  4. backend/tests/functional/workflow/test_worker_timeout.py — the replication test (written by the test-writer agent, not modified here).

How to test

uv run pytest backend/tests/functional/workflow/test_worker_timeout.py::test_execute_workflow_raises_when_no_worker_available -x -v

Expected: 1 passed — FlowRunWaitTimeout is raised after 1 second when no worker picks up the submitted flow run.

uv run pytest backend/tests/unit/ -x -q

Expected: all 765 unit tests pass.

Impact & rollout

  • Backward compatibility: No breaking changes. The new timeout parameter defaults to None, preserving the current unbounded-wait behavior for all existing callers.
  • Performance: No impact when timeout=None (default).
  • Config/env changes: None.
  • Deployment notes: Safe to deploy — no schema, API, or configuration changes.

Checklist

  • Tests added/updated
  • Changelog entry added (uv run towncrier create ...)
  • External docs updated (if user-facing or ops-facing change)
  • Internal .md docs updated (internal knowledge and AI code tools knowledge)
  • I have reviewed AI generated content

Analyst's findings (summary)

Root cause: WorkflowWorkerExecution.execute_workflow calls Prefect's run_deployment with no timeout argument, so the await never returns when the task worker is unreachable — leaving the global_schema_lock held forever inside the /api/schema/load request handler.
Affected files:

  • backend/infrahub/services/adapters/workflow/worker.py — the run_deployment call had no timeout= argument, making the blocking wait unbounded.
  • backend/infrahub/services/adapters/workflow/__init__.py — the abstract execute_workflow signature had no timeout parameter, so there was no interface contract for callers to supply one.
  • backend/infrahub/api/schema.pyasync with lock.registry.global_schema_lock(): wraps the entire validate + apply block; the lock's __aexit__ was never reached because the code was stuck on the unbounded run_deployment await.

Replication test

Test file: backend/tests/functional/workflow/test_worker_timeout.py
Test name: test_execute_workflow_raises_when_no_worker_available

What it tests: execute_workflow raises FlowRunWaitTimeout when called with a timeout parameter and no Prefect worker picks up the submitted flow run within that window.

Verification: Test confirmed PASSING after fix.

AGENT_FIX_COMPLETE

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • sensomatic.prefect.io

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "sensomatic.prefect.io"

See Network Configuration for more information.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Bug fixer agent for issue #9438 · ● 661.5K ·

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented Jun 3, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing ai-bug-pipeline-9051-workflow-timeout-1af5dd2ef94b0c0d (bc3b078) with stable (79721de)1

Open in CodSpeed

Footnotes

  1. No successful run was found on stable (49cb868) during the generation of this report, so 79721de was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

AGENT_REVIEW_VERDICT: TEST_APPROVED

APPROVED WITH SUGGESTIONS

This test correctly reproduces the bug and will pass after the fix is implemented. All critical dimensions check out.


A. Test Realism

All imports resolve to real production values:

  • DUMMY_FLOW (infrahub/tasks/dummy.py) — a real WorkflowDefinition with name="dummy_flow" and full_name="dummy_flow/dummy_flow" matching the run_deployment(name=workflow.full_name, ...) call in worker.py:74
  • INFRAHUB_WORKER_POOL (workflows/catalogue.py:9) — the real pool definition with name="infrahub-worker"
  • InfrahubWorkerAsync.type = "infrahubasync" (workers/infrahub_async.py:69) — the real worker type string
  • TlsContextRegistry — a valid zero-arg instantiation used in production

The Prefect test harness is correctly reused via the session-scoped prefect_test_fixture already defined in backend/tests/functional/conftest.py.


B. Test Correctness

The test fails today with TypeError: WorkflowWorkerExecution.execute_workflow() got an unexpected keyword argument 'timeout' — exactly proving the unbounded-wait bug exists (no timeout parameter, no timeout passed to run_deployment).

After the fix adds timeout to execute_workflow and passes it to run_deployment, the flow run will be submitted but no worker will pick it up; run_deployment will raise FlowRunWaitTimeout after 1 second, which pytest.raises will catch.

The test cannot pass without changing worker.py — no false-positive risk.


C. Test Quality

Generally fine, with two minor points:

1. Redundant fixture dependency in the test function.

test_execute_workflow_raises_when_no_worker_available declares both prefect_test_fixture: None and work_pool_and_deployment: None. work_pool_and_deployment already depends on prefect_test_fixture, so the explicit request at the test level is noise. It's harmless but inconsistent with patterns in the rest of the codebase.

Suggestion (backend/tests/functional/workflow/test_worker_timeout.py:27-30): drop the redundant prefect_test_fixture: None parameter from the test function.

2. # type: ignore[call-overload] will survive the fix unless the overloads are also updated.

After the fix, if timeout is added to the concrete method signature but not to the @overload stubs in __init__.py and worker.py, mypy will still flag the call and the comment will be needed forever. The fix reviewer should ensure the overloads are updated so this suppression can be removed.


D. Alignment with Analysis

The test is tightly aligned with the analyst's root cause:

  • Exercises WorkflowWorkerExecution.execute_workflow directly (the method identified in worker.py:74)
  • The fixture gap (no timeout kwarg) maps 1:1 to the missing timeout parameter called out in the analysis
  • The expected exception (FlowRunWaitTimeout) is the exact one that run_deployment(timeout=N) raises on timeout, and is what the analysis says the fix must propagate

No scope issues — the test does not reach lock.registry.global_schema_lock() or the schema-load handler; it isolates just the workflow execution layer.


Recommended next steps

  1. The fixer should add timeout: float | None = None to InfrahubWorkflow.execute_workflow (abstract base in __init__.py:37-46), to both @overload stubs in worker.py, and to the concrete implementation, then pass timeout=timeout to the run_deployment call at worker.py:74.
  2. The fixer should also add the parameter to WorkflowLocalExecution.execute_workflow (to keep the interface contract consistent, even if the local adapter ignores it).
  3. After the fix, remove # type: ignore[call-overload] from the test if the overloads are properly updated.
  4. (Optional, non-blocking) Remove the redundant prefect_test_fixture: None parameter from the test function signature.

AGENT_REVIEW_ITERATION: test-1

Generated by Bug reviewer agent for issue #9438 · ● 227K ·

@polmichel

Copy link
Copy Markdown
Contributor

/bug-fix

@infrahub-bug-pipeline

infrahub-bug-pipeline Bot commented Jun 3, 2026

Copy link
Copy Markdown
Author

Bug fixer agent completed successfully!

@infrahub-bug-pipeline infrahub-bug-pipeline Bot changed the title test: failing test for #9051 -- execute_workflow hangs with no timeout fix: add timeout to execute_workflow to prevent unbounded lock hold (closes #9051) Jun 3, 2026
infrahub-bug-pipeline Bot and others added 3 commits June 3, 2026 10:07
…9051)

When no Prefect worker is available, WorkflowWorkerExecution.execute_workflow
blocked indefinitely on run_deployment, holding global_schema_lock forever.
Add timeout parameter to the abstract interface and both concrete implementations,
pass it to run_deployment, and raise FlowRunWaitTimeout when the run does not
reach a terminal state within the allotted time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…apter

The timeout parameter is passed through to Prefect's run_deployment API,
not used as an asyncio.timeout() context manager. Suppress the ruff ASYNC109
rule inline on each parameter declaration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

Commit pushed: 794c0b2

Generated by Bug fixer agent

@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

Fix implemented and pushed to this PR.

What was done:

  1. Added timeout: float | None = None to InfrahubWorkflow.execute_workflow (abstract base + both @overload stubs in __init__.py)
  2. Added the same parameter to both @overload stubs and the concrete implementation in WorkflowWorkerExecution.execute_workflow (worker.py), passing timeout=timeout to run_deployment
  3. Added the parameter to WorkflowLocalExecution.execute_workflow (local.py) — accepted but ignored, to keep the interface consistent
  4. After run_deployment returns, added a check: if timeout was given and the flow run did not reach a terminal state (Prefect returns the scheduled flow run after timeout rather than raising), raise FlowRunWaitTimeout explicitly
  5. Added # noqa: ASYNC109 suppressions (ruff rule that flags timeout params on async functions, inapplicable here since timeout is forwarded to Prefect's API, not used as an asyncio.timeout() context manager)

Validation: test_execute_workflow_raises_when_no_worker_available passes (1.0s, FlowRunWaitTimeout raised as expected). All 765 unit tests pass.

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • sensomatic.prefect.io

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "sensomatic.prefect.io"

See Network Configuration for more information.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Bug fixer agent for issue #9438 · ● 661.5K ·

@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

AGENT_REVIEW_VERDICT: FIX_CHANGES_REQUESTED

CHANGES REQUESTED

The fix correctly solves the root cause and all production code paths are properly updated. One concrete subclass of InfrahubWorkflow was missed, which will cause a mypy override error in CI.


A. Correctness

The fix correctly addresses the root cause: run_deployment is now called with timeout=timeout, and if the flow run does not reach a terminal state within the deadline, FlowRunWaitTimeout is raised.

The post-call check is sound given the stated Prefect 3.6.13 behavior (returns a non-final FlowRun on timeout rather than raising). The logic is also safe if Prefect does raise natively — FlowRunWaitTimeout would propagate before the check is reached.

No edge-case correctness issues with the main happy path or existing callers (all pass timeout=None implicitly, matching previous behavior).


B. Code Quality — BLOCKING

backend/tests/adapters/workflow.py:21-28WorkflowRecorder.execute_workflow is missing timeout.

WorkflowRecorder is a concrete subclass of InfrahubWorkflow. The fix added timeout: float | None = None to the abstract base, but WorkflowRecorder.execute_workflow was not updated:

# current (incomplete)
async def execute_workflow(
    self,
    workflow: WorkflowDefinition,
    expected_return: type | None = None,
    context: InfrahubContext | None = None,
    parameters: dict[str, Any] | None = None,
    tags: list[str] | None = None,  # <-- timeout missing
) -> Any:

The mypy config (pyproject.toml) does not exclude backend/tests/adapters/, so mypy checks this file. Omitting a parameter that exists on the parent's abstract method is an [override] incompatible-signature error. This will fail the lint CI step.

Required fix — add the parameter to WorkflowRecorder.execute_workflow:

async def execute_workflow(
    self,
    workflow: WorkflowDefinition,
    expected_return: type | None = None,
    context: InfrahubContext | None = None,
    parameters: dict[str, Any] | None = None,
    tags: list[str] | None = None,
    timeout: float | None = None,  # noqa: ARG002, ASYNC109
) -> Any:

C. Documentation Alignment

changelog/9051.fixed.md is present and accurate. No user-facing API or configuration changes require external doc updates. This dimension is satisfied.


D. Test Quality

The test itself is sound and unchanged from what was approved in the test review. Two minor leftovers from that review were not addressed (non-blocking):

  1. backend/tests/functional/workflow/test_worker_timeout.py:35# type: ignore[call-overload] is no longer needed. Both overloads in worker.py now include timeout: float | None = ..., so the call service.execute_workflow(..., timeout=1.0) matches the second overload cleanly. The suppression should be removed to avoid masking future overload regressions.

  2. test_worker_timeout.py:28 — redundant explicit prefect_test_fixture: None parameter in the test function signature. work_pool_and_deployment already depends on it, and prefect_test_fixture is autouse=True at session scope. Removing it would reduce noise. (Non-blocking.)


Recommended next steps

  1. Blocking: Add timeout: float | None = None, # noqa: ARG002, ASYNC109 to WorkflowRecorder.execute_workflow at backend/tests/adapters/workflow.py:28.
  2. Non-blocking: Remove # type: ignore[call-overload] from backend/tests/functional/workflow/test_worker_timeout.py:35.
  3. Non-blocking: Remove the redundant prefect_test_fixture: None parameter from test_execute_workflow_raises_when_no_worker_available at test_worker_timeout.py:28.

AGENT_REVIEW_ITERATION: fix-1

Generated by Bug reviewer agent for issue #9438 · ● 218.4K ·

@polmichel

Copy link
Copy Markdown
Contributor

/bug-fix

@infrahub-bug-pipeline

infrahub-bug-pipeline Bot commented Jun 3, 2026

Copy link
Copy Markdown
Author

Bug fixer agent completed successfully!

- Add `timeout: float | None = None` to `WorkflowRecorder.execute_workflow`
  to match the updated abstract base signature; the parameter is accepted
  and ignored (recorder semantics are unchanged).
- Remove stale `# type: ignore[call-overload]` from the replication test
  now that both overloads in worker.py include the `timeout` parameter.
- Remove redundant `prefect_test_fixture: None` from the test function
  signature; `work_pool_and_deployment` already depends on it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

Commit pushed: bc3b078

Generated by Bug fixer agent

@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

Revision pushed addressing all items from the review.

Blocking fix:

  • backend/tests/adapters/workflow.py: Added timeout: float | None = None to WorkflowRecorder.execute_workflow to match the updated abstract base signature. Added # noqa: PLR0913, PLR0917 on the function def (adding the 6th non-self parameter pushes it past the adapters-test lint threshold) and # noqa: ASYNC109 on the parameter line.

Non-blocking cleanups:

  • backend/tests/functional/workflow/test_worker_timeout.py:35: Removed stale # type: ignore[call-overload] — both overloads now include timeout, so the call resolves cleanly.
  • backend/tests/functional/workflow/test_worker_timeout.py:28: Removed redundant prefect_test_fixture: None parameter from the test function signature; work_pool_and_deployment already depends on it transitively.

Validation: replication test passes (1 passed), all 765 unit tests pass, ruff check clean.

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • sensomatic.prefect.io

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "sensomatic.prefect.io"

See Network Configuration for more information.

Note

🔒 Integrity filter blocked 1 item

The following item was blocked because it doesn't meet the GitHub integrity level.

  • #9438 search_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Bug fixer agent for issue #9438 · ● 400.7K ·

@infrahub-bug-pipeline

Copy link
Copy Markdown
Author

AGENT_REVIEW_VERDICT: FIX_APPROVED

APPROVED

All blocking items from the previous fix review have been resolved. The fix is correct, complete, and clean across all three production adapters and the test recorder.


A. Correctness

Root cause is fully addressed. run_deployment is now called with timeout=timeout in worker.py:79. The post-call guard at worker.py:84-85 correctly raises FlowRunWaitTimeout when the flow run has not reached a terminal state after the timeout window — matching the documented Prefect 3.x behavior where run_deployment returns a non-final FlowRun on timeout rather than raising.

Edge cases are sound:

  • timeout=None (default): the new guard is short-circuited by if timeout is not None, preserving the previous unbounded-wait behavior exactly.
  • response.state is None: handled by the pre-existing RuntimeError check at worker.py:81-82, before the timeout guard is reached.
  • Worker picks up the run within the timeout: response.state.is_final() returns True, guard does not raise.
  • Backward compatibility: all existing callers pass no timeout argument and continue to receive None semantics.

B. Code Quality

WorkflowRecorder (backend/tests/adapters/workflow.py:21-29) — the blocking issue from fix-1 is resolved. timeout: float | None = None is present at line 28 with the correct # noqa: ASYNC109 suppression.

The # noqa: PLR0913, PLR0917 on the function definition (line 21) is correct and necessary. PLR0913/PLR0917 are suppressed only for backend/infrahub/**.py in the per-file-ignores (pyproject.toml:612,615), not for backend/tests/adapters/**.py (which only suppresses PLR6301 and SIM118 at pyproject.toml:790-796). Adding the sixth non-self parameter would violate these rules without the inline suppression.

All # noqa: ASYNC109 suppressions on timeout parameters are correct: ASYNC109 is not in the global or per-path ignore lists for these paths, and the rule does not apply here since timeout is a Prefect API value forwarded to run_deployment, not an asyncio.timeout() context manager argument.

# noqa: ARG002 suppressions on unused parameters in local.py and worker.py are appropriate.

No unnecessary refactoring, no security or performance concerns.


C. Documentation Alignment

changelog/9051.fixed.md is present and accurately describes the fix. No user-facing API, configuration, or schema changes require external documentation updates. Internal .md docs do not need to be updated for this internal-adapter change.


D. Test Quality

Both non-blocking items from fix-1 are resolved:

  • # type: ignore[call-overload] has been removed from test_worker_timeout.py — the overloads in worker.py and __init__.py now include timeout, so the call resolves cleanly without suppression.
  • The redundant prefect_test_fixture: None parameter has been removed from test_execute_workflow_raises_when_no_worker_available; only work_pool_and_deployment: None remains, which already depends on prefect_test_fixture transitively.

The test exercises observable behavior (exception raised), not implementation details. It cannot pass without the fix. It is isolated, deterministic, and follows project conventions.

One minor observation (non-blocking, no action needed): WorkflowRecorder.execute_calls records workflow and parameters but not timeout. This is consistent with the existing policy of not recording tags, context, or expected_return. If future tests need to assert on timeout values passed to the recorder, a caller would need to extend the recorder — but that is out of scope for this fix.


Recommended next steps

No changes required. The fix is ready for human review and merge.

AGENT_REVIEW_ITERATION: fix-2

Generated by Bug reviewer agent for issue #9438 · ● 223.6K ·

@polmichel polmichel self-assigned this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

group/backend Issue related to the backend (API Server, Git Agent)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: /api/schema/load hangs indefinitely holding global_schema_lock when the task worker is unreachable

1 participant