[CRCR] Add EventBridge sweeper configuration and interval variable#806
[CRCR] Add EventBridge sweeper configuration and interval variable#806can-gaa-hou wants to merge 4 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
atalman
left a comment
There was a problem hiding this comment.
Potential bug with rate() singular/plural grammar
If sweeper_interval_minutes is set to 1, the schedule expression becomes rate(1 minutes), which is invalid — AWS requires rate(1 minute) (singular). Consider either adding a validation constraint:
variable "sweeper_interval_minutes" {
...
validation {
condition = var.sweeper_interval_minutes >= 2
error_message = "Must be >= 2 to avoid rate() singular/plural grammar issue."
}
}Or handling it with a conditional in the expression:
schedule_expression = "rate(${var.sweeper_interval_minutes} ${var.sweeper_interval_minutes == 1 ? \"minute\" : \"minutes\"})"
Thanks for the review. But in most cases, it is not necessary to sweep every 1 minute, since there are not many zombie workflow records to clean. I added a validation constraint with more than 2 minutes and default interval is 10 minutes or more. WDYT @atalman |
|
Hi @atalman @fffrog, this PR is ready to be merged since pytorch/test-infra#8198 has been merged. PTAL, Thanks! |
|
Hi @atalman, |
Overview
Adds an EventBridge scheduled rule that triggers the existing CRCR
callbacklambda every 10 minutes (configurable via the newsweeper_interval_minutesvariable) to drive the active "zombie job" sweeper.aws_cloudwatch_event_rule.sweeper—rate(10 minutes)schedule.aws_cloudwatch_event_target.sweeper— targets the callback lambda with a fixed payload{"source": "crcr.sweeper"}, which the handler routes on to enter the cleanup branch.aws_lambda_permission.sweeper_invoke— allowsevents.amazonaws.comto invoke the callback lambda, scoped to this rule's ARN.sweeper_interval_minutesvariable (default10).Background
Redis keyspace expiration events aren't well-suited for stateless Lambdas, so we use an EventBridge cron + a Redis ZSET (scored by expected timeout) instead. In-progress jobs are tracked in the ZSET and removed on normal completion; anything left past its expiry is a zombie. On each tick the sweeper reaps zombies — marking them as timed out in HUD and on GitHub, then clearing them from the cache.
Dependency / deploy order
This is the infra half of the feature. The callback lambda's routing + cleanup logic lives in pytorch/test-infra#8198 (
if event.get("source") == "crcr.sweeper"). Theinputpayload here matches that branch.Refs: