Add resume support to the AI flow orchestrator#24164
Conversation
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 6892765 | Docs | Datadog PR Page | Give us feedback! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91aa95965a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
AAraKKe
left a comment
There was a problem hiding this comment.
Thanks! Looking good, just small comments and questions.
| ) | ||
| if self._is_resume_frontier: | ||
| prior = (context.get("checkpoints") or {}).get(self._phase_id) or {} | ||
| error = prior.get("error") if isinstance(prior, dict) else None |
There was a problem hiding this comment.
question: when is prior here not be a dict?
There was a problem hiding this comment.
It's a defensive guard, for example, if someone edits checkpoints.yaml manually in the middle of execution, then checkpoints might be corrupted. Just to make sure everything works. Does that sound right?
Thank you!!
Validation ReportAll 21 validations passed. Show details
|
What does this PR do?
Adds resume support to the AI flow orchestrator (framework only — no CLI yet).
When the orchestrator is constructed with
resume=True, it reads the run'scheckpoints.yamland continues a previously failed or interrupted run instead of starting over:CheckpointManager.successful_phases()reports which phases reachedsuccess.execute()is never called) and emits a completionPhaseTriggerfor each so dependent phases unblock through the normal trigger chain. No resume-specific logic leaks into the phases.FlowContext.resume_frontier. AnAgenticPhasein the frontier gets a short resumed-run notice appended to its system prompt, telling the agent this is a re-run, to reconcile any partially-written files, and including the previous error when one was recorded. Interrupted runs that never wrote a checkpoint (e.g.Ctrl+C) still get the notice, just without an error string.The dependency-closure guard matters when the flow definition changes between a run and its resume (e.g. inserting a phase in the middle): a previously-succeeded phase that gains a new, un-run ancestor is correctly re-run instead of skipped with stale inputs.
The dependency-closure computation walks
config.flowin a single pass, which requires dependencies to be classified before the phases that depend on them. To guarantee that without assuming the author wroteflow.yamlin any particular order,FlowConfignow topologically sorts the flow at parse time (a stable sort applied after cycle detection, so phases with no ordering constraint keep their declaration order). Every consumer that iteratesconfig.flow— registration, trigger emission, and the resume closure — can now rely on dependencies appearing before their dependents.Resume is phase-granular: a failed/interrupted phase re-runs from scratch; completed phases are not re-executed.
Next step: a follow-up PR will add the
--resumeflag (and the one-folder-per-integration run directory) to theddev ai openmetricsCLI command, wiringresume=through to this orchestrator.Tests
successful_phases()behavior, including malformed entries.resumecheckpoints are ignored; a corruptcheckpoints.yamlsurfaces as an actionableFlowConfigError; and the closure resolves correctly even when the flow is declared dependents-first.FlowConfigtopological sort: dependents declared before their dependencies are reordered, and the sort is stable for unconstrained phases.AgenticPhasesystem-prompt injection: frontier phase with/without a recorded error; non-frontier phase gets no notice.Unrelated drive-by fix
Also fixed a small bug in two existing
test_registrytests that were asserting against alistwhile the value is atuple(native_tool_names). This is leftover from a previous PR and is unrelated to resume — included only so the suite passes.Motivation
A single transient failure (e.g. an API 500) or a
Ctrl+Cwould abort an entire multi-phase AI run and force re-running everything from scratch, including the expensive completed phases. Resume lets a run pick up from the last successful checkpoint.Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required. —qa/skip-qa(developer-onlyddevtooling, nothing shipped with the Agent)backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged