Skip to content

[CRCR] Add EventBridge sweeper configuration and interval variable#806

Open
can-gaa-hou wants to merge 4 commits into
mainfrom
eventbridge
Open

[CRCR] Add EventBridge sweeper configuration and interval variable#806
can-gaa-hou wants to merge 4 commits into
mainfrom
eventbridge

Conversation

@can-gaa-hou

@can-gaa-hou can-gaa-hou commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Overview

Adds an EventBridge scheduled rule that triggers the existing CRCR callback lambda every 10 minutes (configurable via the new sweeper_interval_minutes variable) to drive the active "zombie job" sweeper.

  • aws_cloudwatch_event_rule.sweeperrate(10 minutes) schedule.
  • aws_cloudwatch_event_target.sweeper — targets the callback lambda with a fixed payload {"source": "crcr.sweeper"}, which the handler routes on to enter the cleanup branch.
  • aws_lambda_permission.sweeper_invoke — allows events.amazonaws.com to invoke the callback lambda, scoped to this rule's ARN.
  • New sweeper_interval_minutes variable (default 10).

Background

Redis keyspace expiration events aren't well-suited for stateless Lambdas, so we use an EventBridge cron + a Redis ZSET (scored by expected timeout) instead. In-progress jobs are tracked in the ZSET and removed on normal completion; anything left past its expiry is a zombie. On each tick the sweeper reaps zombies — marking them as timed out in HUD and on GitHub, then clearing them from the cache.

Dependency / deploy order

This is the infra half of the feature. The callback lambda's routing + cleanup logic lives in pytorch/test-infra#8198 (if event.get("source") == "crcr.sweeper"). The input payload here matches that branch.

Refs:

@datadog-pytorch-via-lf

This comment has been minimized.

@atalman atalman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug with rate() singular/plural grammar

If sweeper_interval_minutes is set to 1, the schedule expression becomes rate(1 minutes), which is invalid — AWS requires rate(1 minute) (singular). Consider either adding a validation constraint:

variable "sweeper_interval_minutes" {
  ...
  validation {
    condition     = var.sweeper_interval_minutes >= 2
    error_message = "Must be >= 2 to avoid rate() singular/plural grammar issue."
  }
}

Or handling it with a conditional in the expression:

schedule_expression = "rate(${var.sweeper_interval_minutes} ${var.sweeper_interval_minutes == 1 ? \"minute\" : \"minutes\"})"

@can-gaa-hou

Copy link
Copy Markdown
Collaborator Author

If sweeper_interval_minutes is set to 1, the schedule expression becomes rate(1 minutes), which is invalid

Thanks for the review. But in most cases, it is not necessary to sweep every 1 minute, since there are not many zombie workflow records to clean. I added a validation constraint with more than 2 minutes and default interval is 10 minutes or more. WDYT @atalman

@fffrog fffrog left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you.

@can-gaa-hou can-gaa-hou marked this pull request as ready for review June 26, 2026 08:28
@can-gaa-hou

can-gaa-hou commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Hi @atalman @fffrog, this PR is ready to be merged since pytorch/test-infra#8198 has been merged. PTAL, Thanks!

@can-gaa-hou

Copy link
Copy Markdown
Collaborator Author

Hi @atalman, v20260629-172623 already contains the code we need. We don't need to change the tag number in Terrafile. This PR is ready to merge. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants