Skip to content

Validate pipeline checkpoints with Pydantic models#24180

Open
luisorofino wants to merge 2 commits into
loa/orchestrator-resumefrom
loa/checkpoints-pydantic-validation
Open

Validate pipeline checkpoints with Pydantic models#24180
luisorofino wants to merge 2 commits into
loa/orchestrator-resumefrom
loa/checkpoints-pydantic-validation

Conversation

@luisorofino

@luisorofino luisorofino commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Replaces the untyped dict[str, Any] checkpoint storage with Pydantic models (SuccessCheckpoint, FailedCheckpoint, CheckpointTokenInfo), validated on read via a discriminated union on the status field.

Key changes:

  • CheckpointStatus is now a StrEnum instead of a plain string, eliminating isinstance(data, dict) guards scattered across the codebase
  • CheckpointManager.read() returns dict[str, PhaseCheckpoint] and raises CheckpointReadError on any invalid entry (previously silently ignored)
  • CheckpointManager.write_phase_checkpoint() now accepts a PhaseCheckpoint model instead of a raw dict
  • Call sites in base.py and agentic_phase.py construct typed models directly

Motivated by the review comment in #24164: #24164 (comment)

Note for reviewers: Several test files (test_base.py, test_agentic_phase.py, test_inspect_endpoint.py, test_orchestrator.py) have mechanical changes to use attribute access (checkpoint.status, checkpoint.tokens) instead of dict subscripting (checkpoint["status"]). Their assertions and coverage are unchanged.

Motivation

The untyped checkpoint dict made it impossible to know the shape of a checkpoint without reading all write sites. The isinstance(data, dict) guard in successful_phases() was a symptom of this. Pydantic models make the schema explicit, catch corruption early, and let callers use attribute access instead of string keys.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@luisorofino luisorofino added the qa/skip-qa Automatically skip this PR for the next QA label Jun 25, 2026
@luisorofino luisorofino requested a review from a team as a code owner June 25, 2026 11:02
@dd-octo-sts dd-octo-sts Bot added the ddev label Jun 25, 2026
@luisorofino luisorofino marked this pull request as draft June 25, 2026 11:03
@luisorofino luisorofino changed the title Add checkpoint validation from Pydantic Validate pipeline checkpoints with Pydantic models Jun 25, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1df4318d92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +86 to +88
result[phase_id] = _CHECKPOINT_ADAPTER.validate_python(data)
except ValidationError as e:
raise CheckpointReadError(f"Checkpoint for phase {phase_id!r} in {self._path} is invalid: {e}") from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fresh runs from validating stale checkpoints

Because Phase.process_message unconditionally calls self._checkpoint_manager.read() when building the phase context, this validation now runs even when the orchestrator is started without resume. A valid-YAML but schema-invalid checkpoint left by an older/interrupted/manual run, such as a status: success entry missing the new tokens or memory_path fields, will abort a fresh run before it can overwrite the file, despite non-resume runs registering every phase from scratch. Consider restricting strict validation to resume paths or otherwise tolerating stale entries for fresh runs.

Useful? React with 👍 / 👎.

@luisorofino luisorofino marked this pull request as ready for review June 25, 2026 11:16
@dd-octo-sts

dd-octo-sts Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 25, 2026

Copy link
Copy Markdown

Pipelines  Tests  Code Coverage

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

PR All | test / j06ca546 / SNMP   View in Datadog   GitHub Actions

🧪 1 Test failed in 1 job

PR All | run   GitHub Actions

All test failures are known flaky — job may pass on retry.

❄️ Known flaky: test_e2e_snmp_listener from test_e2e_snmp_listener.py   View in Datadog
Needed at least 1 candidates for &#39;datadog.snmp.check_duration&#39;, got 0
Expected:
        MetricStub(name=&#39;datadog.snmp.check_duration&#39;, type=0, value=None, tags=[&#39;autodiscovery_subnet:172.18.0.0/28&#39;, &#39;device_vendor:apc&#39;, &#39;firmware_version:2.0.3-test&#39;, &#39;loader:python&#39;, &#39;model:APC Smart-UPS 600&#39;, &#39;serial_num:test_serial&#39;, &#39;snmp_device:172.18.0.1&#39;, &#39;snmp_profile:apc_ups&#39;, &#39;ups_name:testIdentName&#39;], hostname=None, device=None, flush_first_value=None)
Difference to closest:
        Expected tag snmp_device:172.18.0.1
        Found snmp_device:172.18.0.2

Similar submitted:
Score   Most similar
1.00    MetricStub(name=&#39;datadog.snmp.check_duration&#39;, type=0, value=0.21761059761047363, tags=[&#39;autodiscovery_subnet:172.18.0.0/28&#39;, &#39;device_vendor:apc&#39;, &#39;firmware_version:2.0.3-test&#39;, &#39;loader:python&#39;, &#39;model:APC Smart-UPS 600&#39;, &#39;serial_num:test_serial&#39;, &#39;snmp_device:172.18.0.2&#39;, &#39;snmp_profile:apc_ups&#39;, &#39;ups_name:testIdentName&#39;], hostname=&#39;runnervm08nci&#39;, device=None, flush_first_value=False)
...

Not introduced in this PR.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 88.46% (+0.00%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b4e393a | Docs | Datadog PR Page | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddev qa/skip-qa Automatically skip this PR for the next QA team/agent-integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant