Add resume support to the AI flow orchestrator by luisorofino · Pull Request #24164 · DataDog/integrations-core

luisorofino · 2026-06-24T11:13:58Z

What does this PR do?

Adds resume support to the AI flow orchestrator (framework only — no CLI yet).

When the orchestrator is constructed with resume=True, it reads the run's checkpoints.yaml and continues a previously failed or interrupted run instead of starting over:

CheckpointManager.successful_phases() reports which phases reached success.
The orchestrator computes a dependency-closed completed set: a phase counts as done only if it and all of its transitive dependencies succeeded. It then skips registering those phases (their execute() is never called) and emits a completion PhaseTrigger for each so dependent phases unblock through the normal trigger chain. No resume-specific logic leaks into the phases.
The frontier (the phases that run first on resume — the ones that may be sitting on partial work) is carried via FlowContext.resume_frontier. An AgenticPhase in the frontier gets a short resumed-run notice appended to its system prompt, telling the agent this is a re-run, to reconcile any partially-written files, and including the previous error when one was recorded. Interrupted runs that never wrote a checkpoint (e.g. Ctrl+C) still get the notice, just without an error string.

The dependency-closure guard matters when the flow definition changes between a run and its resume (e.g. inserting a phase in the middle): a previously-succeeded phase that gains a new, un-run ancestor is correctly re-run instead of skipped with stale inputs.

The dependency-closure computation walks config.flow in a single pass, which requires dependencies to be classified before the phases that depend on them. To guarantee that without assuming the author wrote flow.yaml in any particular order, FlowConfig now topologically sorts the flow at parse time (a stable sort applied after cycle detection, so phases with no ordering constraint keep their declaration order). Every consumer that iterates config.flow — registration, trigger emission, and the resume closure — can now rely on dependencies appearing before their dependents.

Resume is phase-granular: a failed/interrupted phase re-runs from scratch; completed phases are not re-executed.

Next step: a follow-up PR will add the --resume flag (and the one-folder-per-integration run directory) to the ddev ai openmetrics CLI command, wiring resume= through to this orchestrator.

Tests

successful_phases() behavior, including malformed entries.
Orchestrator resume: completed phases are skipped; only the frontier is marked; completion triggers are emitted; the dependency-closure re-runs descendants of a failed ancestor; an interrupted run with no checkpoints makes the root the frontier; without resume checkpoints are ignored; a corrupt checkpoints.yaml surfaces as an actionable FlowConfigError; and the closure resolves correctly even when the flow is declared dependents-first.
FlowConfig topological sort: dependents declared before their dependencies are reordered, and the sort is stable for unconstrained phases.
AgenticPhase system-prompt injection: frontier phase with/without a recorded error; non-frontier phase gets no notice.

Unrelated drive-by fix

Also fixed a small bug in two existing test_registry tests that were asserting against a list while the value is a tuple (native_tool_names). This is leftover from a previous PR and is unrelated to resume — included only so the suite passes.

Motivation

A single transient failure (e.g. an API 500) or a Ctrl+C would abort an entire multi-phase AI run and force re-running everything from scratch, including the expensive completed phases. Resume lets a run pick up from the last successful checkpoint.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required. — qa/skip-qa (developer-only ddev tooling, nothing shipped with the Agent)
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

datadog-datadog-prod-us1 · 2026-06-24T11:16:05Z

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
• Patch Coverage: 99.44%
• Overall Coverage: 88.46%

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 6892765 | Docs | Datadog PR Page | Give us feedback!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91aa95965a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

AAraKKe

Thanks! Looking good, just small comments and questions.

AAraKKe · 2026-06-25T08:49:36Z

        )
+        if self._is_resume_frontier:
+            prior = (context.get("checkpoints") or {}).get(self._phase_id) or {}
+            error = prior.get("error") if isinstance(prior, dict) else None


question: when is prior here not be a dict?

It's a defensive guard, for example, if someone edits checkpoints.yaml manually in the middle of execution, then checkpoints might be corrupted. Just to make sure everything works. Does that sound right?

Thank you!!

dd-octo-sts · 2026-06-25T09:53:07Z

Validation Report

All 21 validations passed.

Show details

Validation	Description	Status
`agent-reqs`	Verify check versions match the Agent requirements file	✅
`ci`	Validate CI configuration and code coverage settings	✅
`codeowners`	Validate every integration has a CODEOWNERS entry	✅
`config`	Validate default configuration files against spec.yaml	✅
`dep`	Verify dependency pins are consistent and Agent-compatible	✅
`http`	Validate integrations use the HTTP wrapper correctly	✅
`imports`	Validate check imports do not use deprecated modules	✅
`integration-style`	Validate check code style conventions	✅
`jmx-metrics`	Validate JMX metrics definition files and config	✅
`labeler`	Validate PR labeler config matches integration directories	✅
`legacy-signature`	Validate no integration uses the legacy Agent check signature	✅
`license-headers`	Validate Python files have proper license headers	✅
`licenses`	Validate third-party license attribution list	✅
`metadata`	Validate metadata.csv metric definitions	✅
`models`	Validate configuration data models match spec.yaml	✅
`openmetrics`	Validate OpenMetrics integrations disable the metric limit	✅
`package`	Validate Python package metadata and naming	✅
`qa-label`	Validate the pull request declares whether it needs QA for the next Agent release	✅
`readmes`	Validate README files have required sections	✅
`saved-views`	Validate saved view JSON file structure and fields	✅
`version`	Validate version consistency between package and changelog	✅

View full run

Add resume option to orchestrator

91aa959

luisorofino requested a review from a team as a code owner June 24, 2026 11:13

dd-octo-sts Bot added ddev team/agent-integrations labels Jun 24, 2026

luisorofino changed the title ~~Add resume option to orchestrator~~ Add resume support to the AI flow orchestrator Jun 24, 2026

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread ddev/src/ddev/ai/runtime/orchestrator.py

luisorofino added the qa/skip-qa Automatically skip this PR for the next QA label Jun 24, 2026

luisorofino added 2 commits June 24, 2026 17:14

Fixes

3e4bd10

Sort topologically the flow config when parsing

30a8b3e

AAraKKe requested changes Jun 25, 2026

View reviewed changes

Remove slashes from injection prompt in agentic_phase

6892765

luisorofino mentioned this pull request Jun 25, 2026

Validate pipeline checkpoints with Pydantic models #24180

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add resume support to the AI flow orchestrator#24164

Add resume support to the AI flow orchestrator#24164
luisorofino wants to merge 4 commits into
loa/openmetrics-ai-genfrom
loa/orchestrator-resume

luisorofino commented Jun 24, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1 Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

AAraKKe left a comment

Uh oh!

Uh oh!

AAraKKe Jun 25, 2026

Uh oh!

luisorofino Jun 25, 2026

Uh oh!

Uh oh!

dd-octo-sts Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

luisorofino commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tests

Unrelated drive-by fix

Motivation

Review checklist (to be filled by reviewers)

Uh oh!

datadog-datadog-prod-us1 Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

AAraKKe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AAraKKe Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

luisorofino Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dd-octo-sts Bot commented Jun 25, 2026

Validation Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luisorofino commented Jun 24, 2026 •

edited

Loading

datadog-datadog-prod-us1 Bot commented Jun 24, 2026 •

edited

Loading