Skip to content

Emit metric dag auto paused#69078

Open
takayoshi-makabe wants to merge 3 commits into
apache:mainfrom
takayoshi-makabe:emit-metric-dag-auto-paused
Open

Emit metric dag auto paused#69078
takayoshi-makabe wants to merge 3 commits into
apache:mainfrom
takayoshi-makabe:emit-metric-dag-auto-paused

Conversation

@takayoshi-makabe

Copy link
Copy Markdown

Emit dag.auto_paused metric when the scheduler automatically pauses a DAG after exceeding max_consecutive_failed_dag_runs. Platform teams can now alert on this event directly via StatsD/OpenTelemetry without polling the database or correlating failure metrics manually.

The metric includes dag_id and run_type tags via the existing stats_tags property on DagRun.

closes: #69004


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Claude Code


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Comment on lines 950 to 952
)
stats.incr("dag.auto_paused", tags=self.stats_tags)
session.add(

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • dag.* namespace fits because auto-pause is a DagModel state change, not a single-run event
    (dagrun.*).
  • self.stats_tags already contains dag_id and run_type, consistent with other metric call sites
    in this file.

Comment on lines +1152 to +1155
mock_stats_incr.assert_any_call(
"dag.auto_paused",
tags={"dag_id": dag_id, "run_type": DagRunType.MANUAL},
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_any_call instead of assert_called_once_with because update_state() emits multiple stats.incr calls (task state changes, duration, etc.) before reaching the auto-pause check.

@henry3260

Copy link
Copy Markdown
Contributor

CI falling

The metric was emitted in code but missing from the registry,
causing the check-metrics-synced-with-registry static check to fail.
@takayoshi-makabe

Copy link
Copy Markdown
Author

@henry3260

I fixed it.
da51ff1

The commands below execute successfully.

uv run python scripts/ci/prek/check_metrics_synced_with_the_registry.py airflow-core/src/airflow/models/dagrun.py

Could you please rerun the GitHub Actions?

@takayoshi-makabe

Copy link
Copy Markdown
Author

@henry3260
Sorry about that. The CI passed, but I updated the branch afterward. Could you please run it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit metrics when Airflow auto pauses a DAG

2 participants