Skip to content

Add resume support to the AI flow orchestrator#24164

Open
luisorofino wants to merge 4 commits into
loa/openmetrics-ai-genfrom
loa/orchestrator-resume
Open

Add resume support to the AI flow orchestrator#24164
luisorofino wants to merge 4 commits into
loa/openmetrics-ai-genfrom
loa/orchestrator-resume

Conversation

@luisorofino

@luisorofino luisorofino commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds resume support to the AI flow orchestrator (framework only — no CLI yet).

When the orchestrator is constructed with resume=True, it reads the run's checkpoints.yaml and continues a previously failed or interrupted run instead of starting over:

  • CheckpointManager.successful_phases() reports which phases reached success.
  • The orchestrator computes a dependency-closed completed set: a phase counts as done only if it and all of its transitive dependencies succeeded. It then skips registering those phases (their execute() is never called) and emits a completion PhaseTrigger for each so dependent phases unblock through the normal trigger chain. No resume-specific logic leaks into the phases.
  • The frontier (the phases that run first on resume — the ones that may be sitting on partial work) is carried via FlowContext.resume_frontier. An AgenticPhase in the frontier gets a short resumed-run notice appended to its system prompt, telling the agent this is a re-run, to reconcile any partially-written files, and including the previous error when one was recorded. Interrupted runs that never wrote a checkpoint (e.g. Ctrl+C) still get the notice, just without an error string.

The dependency-closure guard matters when the flow definition changes between a run and its resume (e.g. inserting a phase in the middle): a previously-succeeded phase that gains a new, un-run ancestor is correctly re-run instead of skipped with stale inputs.

The dependency-closure computation walks config.flow in a single pass, which requires dependencies to be classified before the phases that depend on them. To guarantee that without assuming the author wrote flow.yaml in any particular order, FlowConfig now topologically sorts the flow at parse time (a stable sort applied after cycle detection, so phases with no ordering constraint keep their declaration order). Every consumer that iterates config.flow — registration, trigger emission, and the resume closure — can now rely on dependencies appearing before their dependents.

Resume is phase-granular: a failed/interrupted phase re-runs from scratch; completed phases are not re-executed.

Next step: a follow-up PR will add the --resume flag (and the one-folder-per-integration run directory) to the ddev ai openmetrics CLI command, wiring resume= through to this orchestrator.

Tests

  • successful_phases() behavior, including malformed entries.
  • Orchestrator resume: completed phases are skipped; only the frontier is marked; completion triggers are emitted; the dependency-closure re-runs descendants of a failed ancestor; an interrupted run with no checkpoints makes the root the frontier; without resume checkpoints are ignored; a corrupt checkpoints.yaml surfaces as an actionable FlowConfigError; and the closure resolves correctly even when the flow is declared dependents-first.
  • FlowConfig topological sort: dependents declared before their dependencies are reordered, and the sort is stable for unconstrained phases.
  • AgenticPhase system-prompt injection: frontier phase with/without a recorded error; non-frontier phase gets no notice.

Unrelated drive-by fix

Also fixed a small bug in two existing test_registry tests that were asserting against a list while the value is a tuple (native_tool_names). This is leftover from a previous PR and is unrelated to resume — included only so the suite passes.

Motivation

A single transient failure (e.g. an API 500) or a Ctrl+C would abort an entire multi-phase AI run and force re-running everything from scratch, including the expensive completed phases. Resume lets a run pick up from the last successful checkpoint.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required. — qa/skip-qa (developer-only ddev tooling, nothing shipped with the Agent)
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Tests  Code Coverage

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 99.44%
Overall Coverage: 88.46%

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 6892765 | Docs | Datadog PR Page | Give us feedback!

@luisorofino luisorofino changed the title Add resume option to orchestrator Add resume support to the AI flow orchestrator Jun 24, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91aa95965a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddev/src/ddev/ai/runtime/orchestrator.py
@luisorofino luisorofino added the qa/skip-qa Automatically skip this PR for the next QA label Jun 24, 2026

@AAraKKe AAraKKe left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good, just small comments and questions.

Comment thread ddev/src/ddev/ai/phases/agentic_phase.py Outdated
)
if self._is_resume_frontier:
prior = (context.get("checkpoints") or {}).get(self._phase_id) or {}
error = prior.get("error") if isinstance(prior, dict) else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: when is prior here not be a dict?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a defensive guard, for example, if someone edits checkpoints.yaml manually in the middle of execution, then checkpoints might be corrupted. Just to make sure everything works. Does that sound right?

Thank you!!

Comment thread ddev/src/ddev/ai/runtime/checkpoints.py
@dd-octo-sts

dd-octo-sts Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddev qa/skip-qa Automatically skip this PR for the next QA team/agent-integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants