Handle missed deadline alerts when DagRuns fail#68961
Conversation
5299bc4 to
bf27fda
Compare
| def _handle_missed_deadlines_with_error_capture(self, *, session: Session) -> None: | ||
| try: | ||
| self._handle_missed_deadlines(session=session) | ||
| except Exception: |
There was a problem hiding this comment.
missed deadline is a side effect and it should not impact the state change of the dag.
bf27fda to
48d9021
Compare
|
@shivaam — the Static checks CI job is failing here, which needs a code fix on your side (a rerun won't clear it). You can reproduce and fix locally with: Once those pass and CI is green, it'll be ready for a maintainer to pick up. Thanks! See the PR quality criteria. Automated first-pass triage note drafted by an AI-assisted tool — may get things wrong; once addressed, a real Apache Airflow maintainer takes the next look. (why automated) Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting |
48d9021 to
3eb066b
Compare
|
Thanks for the PR and the thorough testing. This had a pretty long discussion in the Issue and I still don't think short-circuiting by default is the right answer. Let's say you have a Dag which generates a report which you need for your morning meeting so you set a Deadline Alert for 9AM which emails you saying your report won't be ready. Case 1) You work at Big Company. The Dag fails at 7AM, an admin sees this, determines it's just a network hiccup and clears the failing tasks which then pass at 7:30. If the deadline fired on failure, you would have your report and a notice saying you won't. Case 2) Your Dag may have an Case 3) You work at Small Company and don't have an admin team, you do it all yourself, so a Dag failing likely won't be cleared and the missed deadline is inevitable. I'd argue that Case 4) You realize something is wrong with the input and you manually mark the Dag as FAILED so the report doesn't get generated with bad data. Are you going to fix the data and re-trigger the Dag, or is that the end of the story? We don't know and shouldn't guess. Worth noting: this PR wouldn't cover this case either, which would introduce a difference in behavior between an automatic failure and a manual one. Overall, always short-circuiting is not the answer. I would propose three alternatives which are not mutually exclusive:
To be clear, I do agree that there is a real user need here, I'm not trying to kill this or dissuade you. I just want to make sure we solve it without breaking the existing use cases. It's one very valid use case, but that doesn't invalidate the others. That said, here's my suggestion: I'd also ask that the ((tagging @ramitkataria since he was also involved in the original discussion and may have thoughts.)) |
|
@ferruzzi thanks, this makes sense. I split the independent For this PR, I agree that always firing the Deadline Alert on failure is too broad as the default. I will rework this toward an opt-in The prune-hook and UI/manual-failure options also make sense to me, but I think those can be tracked as follow-up issues unless you think they should be included in this PR. |
Summary
DagRun.update_state()transitions a run toFAILED.Deadline.prune_deadlines()limited to successful DagRun cleanup.effects do not block DagRun finalization.
callback.data, so both triggerer and executor callback paths survive commit/reload.deadline.idas a string in callback context so async trigger kwargs can be stored and run by the triggerer.Closes: #60927
Details
Failed DagRuns currently leave pending deadlines untouched until the scheduler's expired-deadline loop reaches
deadline_time. This can delay the configured deadline alert by hours after the run has already failed.This change treats scheduler-driven DagRun failure as an immediate missed deadline for pending deadlines on that DagRun. It uses the same row-locking shape as the scheduler deadline loop and filters already-missed rows to avoid duplicate handling.
Because deadline alerting is a side effect of the DagRun state transition, this also catches and logs immediate deadline-handling failures instead of letting them escape
DagRun.update_state()and block the run from being marked failed.While validating the callback path, the executor callback path also needed top-level routing fields persisted in
callback.data; otherwise the scheduler could mark the callback pending but not enqueue it. The triggerer callback path needed the same JSON reassignment discipline and a string deadline id because trigger kwargs are serialized before the triggerer runs them.Manual/API/UI "mark failed" remains out of scope for this patch.
Tests
Result:
9 passed, 1 warning.Result:
All checks passed!Result:
4 files already formatted.Also tested manually in a real local Airflow daemon environment:
successand wrote a marker row.successand wrote a marker row.Final E2E rows: