Skip to content

Handle missed deadline alerts when DagRuns fail#68961

Open
shivaam wants to merge 1 commit into
apache:mainfrom
shivaam:codex/airflow-60927-deadline-failure
Open

Handle missed deadline alerts when DagRuns fail#68961
shivaam wants to merge 1 commit into
apache:mainfrom
shivaam:codex/airflow-60927-deadline-failure

Conversation

@shivaam

@shivaam shivaam commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Handle pending DagRun deadlines immediately when DagRun.update_state() transitions a run to FAILED.
  • Keep Deadline.prune_deadlines() limited to successful DagRun cleanup.
  • Log and ignore immediate deadline-handling errors so deadline alert side
    effects do not block DagRun finalization.
  • Persist deadline callback context/routing data by reassigning callback.data, so both triggerer and executor callback paths survive commit/reload.
  • Serialize deadline.id as a string in callback context so async trigger kwargs can be stored and run by the triggerer.

Closes: #60927

Details

Failed DagRuns currently leave pending deadlines untouched until the scheduler's expired-deadline loop reaches deadline_time. This can delay the configured deadline alert by hours after the run has already failed.

This change treats scheduler-driven DagRun failure as an immediate missed deadline for pending deadlines on that DagRun. It uses the same row-locking shape as the scheduler deadline loop and filters already-missed rows to avoid duplicate handling.

Because deadline alerting is a side effect of the DagRun state transition, this also catches and logs immediate deadline-handling failures instead of letting them escape DagRun.update_state() and block the run from being marked failed.

While validating the callback path, the executor callback path also needed top-level routing fields persisted in callback.data; otherwise the scheduler could mark the callback pending but not enqueue it. The triggerer callback path needed the same JSON reassignment discipline and a string deadline id because trigger kwargs are serialized before the triggerer runs them.

Manual/API/UI "mark failed" remains out of scope for this patch.

Tests

uv run --project airflow-core pytest -q airflow-core/tests/unit/models/test_deadline.py::TestDeadline::test_handle_miss airflow-core/tests/unit/models/test_deadline.py::TestDeadline::test_handle_miss_persists_triggerer_callback_context airflow-core/tests/unit/models/test_deadline.py::TestDeadline::test_handle_miss_persists_executor_callback_routing_data airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_success_prunes_dagrun_deadlines airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_success_skips_pruning_non_dagrun_deadlines airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_success_handles_empty_deadline_list airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_failure_handles_pending_deadline airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_failure_ignores_missed_deadline_handling_error airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_deadline_variable_interval_stable --with-db-init

Result: 9 passed, 1 warning.

uv run ruff check airflow-core/src/airflow/models/dagrun.py airflow-core/src/airflow/models/deadline.py airflow-core/tests/unit/models/test_dagrun.py airflow-core/tests/unit/models/test_deadline.py

Result: All checks passed!

uv run ruff format --check airflow-core/src/airflow/models/dagrun.py airflow-core/src/airflow/models/deadline.py airflow-core/tests/unit/models/test_dagrun.py airflow-core/tests/unit/models/test_deadline.py

Result: 4 files already formatted.

Also tested manually in a real local Airflow daemon environment:

  • API server, dag processor, triggerer, and two schedulers.
  • Local Postgres metadata DB and LocalExecutor.
  • Sync/executor deadline callback reached success and wrote a marker row.
  • Async/triggerer deadline callback reached success and wrote a marker row.

Final E2E rows:

async: dagrun_state=failed, deadline_missed=True, callback_type=triggerer, callback_state=success
sync:  dagrun_state=failed, deadline_missed=True, callback_type=executor, callback_state=success

@shivaam shivaam requested review from XD-DENG and ashb as code owners June 25, 2026 03:17
@boring-cyborg boring-cyborg Bot added the area:deadline-alerts AIP-86 (former AIP-57) label Jun 25, 2026
@shivaam shivaam force-pushed the codex/airflow-60927-deadline-failure branch from 5299bc4 to bf27fda Compare June 25, 2026 03:38
def _handle_missed_deadlines_with_error_capture(self, *, session: Session) -> None:
try:
self._handle_missed_deadlines(session=session)
except Exception:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed deadline is a side effect and it should not impact the state change of the dag.

@shivaam shivaam force-pushed the codex/airflow-60927-deadline-failure branch from bf27fda to 48d9021 Compare June 25, 2026 03:54
@potiuk

potiuk commented Jun 25, 2026

Copy link
Copy Markdown
Member

@shivaam — the Static checks CI job is failing here, which needs a code fix on your side (a rerun won't clear it). You can reproduce and fix locally with:

pre-commit run --all-files     # or: breeze static-checks

Once those pass and CI is green, it'll be ready for a maintainer to pick up. Thanks!

See the PR quality criteria.

Automated first-pass triage note drafted by an AI-assisted tool — may get things wrong; once addressed, a real Apache Airflow maintainer takes the next look. (why automated)


Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting

@shivaam shivaam force-pushed the codex/airflow-60927-deadline-failure branch from 48d9021 to 3eb066b Compare June 25, 2026 17:09
@eladkal eladkal requested a review from ferruzzi June 26, 2026 03:48
@ferruzzi

Copy link
Copy Markdown
Contributor

Thanks for the PR and the thorough testing. This had a pretty long discussion in the Issue and I still don't think short-circuiting by default is the right answer.

Let's say you have a Dag which generates a report which you need for your morning meeting so you set a Deadline Alert for 9AM which emails you saying your report won't be ready.

Case 1) You work at Big Company. The Dag fails at 7AM, an admin sees this, determines it's just a network hiccup and clears the failing tasks which then pass at 7:30. If the deadline fired on failure, you would have your report and a notice saying you won't.

Case 2) Your Dag may have an on_failure_callback configured which notifies you of the failure. If the deadline also fires immediately, you now get two notifications about the same event with different messaging: Schrodinger's Dag is both failed and late.

Case 3) You work at Small Company and don't have an admin team, you do it all yourself, so a Dag failing likely won't be cleared and the missed deadline is inevitable. I'd argue that on_failure_callback currently handles this, but I can see the advantage of bundling it into the Deadline Alert directly..

Case 4) You realize something is wrong with the input and you manually mark the Dag as FAILED so the report doesn't get generated with bad data. Are you going to fix the data and re-trigger the Dag, or is that the end of the story? We don't know and shouldn't guess. Worth noting: this PR wouldn't cover this case either, which would introduce a difference in behavior between an automatic failure and a manual one.

Overall, always short-circuiting is not the answer. I would propose three alternatives which are not mutually exclusive:

  1. Add a flag to the Deadline Alert class. Naming is hard, but something like fire_on_failure: bool = False and let the user decide depending on their use case.
  2. Add a hook in the Deadline class which accepts a run_id and prunes all future deadlines for that run. This would allow on_failure_callback (or any other callback) to take action and then call the new hook to remove the now-redundant deadline(s) so you don't get double-notified.
  3. Anywhere in the UI where a user can mark a Dag as FAILED should have an option to also clear related Deadlines.

To be clear, I do agree that there is a real user need here, I'm not trying to kill this or dissuade you. I just want to make sure we solve it without breaking the existing use cases. It's one very valid use case, but that doesn't invalidate the others.

That said, here's my suggestion: I'd also ask that the callback.data persistence fixes in deadline.py be split into a separate PR since those fix a real bug independent of this feature question. Whatever we do about the failure state, that looks like an actual legit improvement/bug-fix. Then I'd suggest reworking this PR to implement the flag and tracking the other two options as new Issues.

((tagging @ramitkataria since he was also involved in the original discussion and may have thoughts.))

@shivaam

shivaam commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@ferruzzi thanks, this makes sense. I split the independent callback.data persistence fix out into #69073 so it can be reviewed separately from the failure-behavior question. That PR is scoped to Deadline.handle_miss() persisting the callback context/routing data correctly.

For this PR, I agree that always firing the Deadline Alert on failure is too broad as the default. I will rework this toward an opt-in DeadlineAlert flag (name TBD, maybe something like fire_on_failure / alert_on_failure, defaulting to False) so users can choose this behavior for the cases where a failed run means the deadline is inevitably missed.

The prune-hook and UI/manual-failure options also make sense to me, but I think those can be tracked as follow-up issues unless you think they should be included in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:deadline-alerts AIP-86 (former AIP-57)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadline Alert triggered after dag failure

3 participants