fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596
fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596hsong-rh wants to merge 2 commits into
Conversation
_detect_running_status() unconditionally reset failure_count to 0 on every STARTING→RUNNING transition. This prevented the ACTIVATION_MAX_RESTARTS_ON_FAILURE circuit breaker from ever tripping for activations that crash after reaching RUNNING status, causing infinite restart loops and cluster-wide resource exhaustion. Remove the _reset_failure_count() call from _detect_running_status() and the now-dead method. failure_count is still properly reset by user-initiated actions (enable, project sync recovery). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesPreserve failure_count on RUNNING transition
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ression set_latest_instance_status() sets updated_at = models.functions.Now(), a SQL expression that remains in-memory as a Now() object rather than a datetime. The subsequent _is_unresponsive() check fails comparing Now() < datetime. Previously _reset_failure_count()'s @run_with_lock decorator incidentally called refresh_from_db() which cleared the cached latest_instance. Add the refresh explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #1596 +/- ##
==========================================
- Coverage 92.35% 92.35% -0.01%
==========================================
Files 244 244
Lines 11214 11210 -4
==========================================
- Hits 10357 10353 -4
Misses 857 857
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
| running_container_status_mock: MagicMock, | ||
| ): | ||
| """AAP-78375: failure_count must not reset on STARTING->RUNNING.""" | ||
| starting_activation.failure_count = 3 |
There was a problem hiding this comment.
I think I understand the reason for the change. We don't want to restart the failure count if the activation starts --> runs --> 3 seconds later it falls over. What about cases where it runs for hours/days? Shouldn't we restart the count at that point?
There was a problem hiding this comment.
@AlexSCorey A failure is a failure regardless of its uptime. If an activation fails 5 times over weeks, that's still a pattern worth stopping. The admin can always disable/enable to reset the counter if they know a failure was transient. What do you think?



https://redhat.atlassian.net/browse/AAP-78375
Summary
_detect_running_status()unconditionally resetfailure_countto 0 on every STARTING→RUNNING transition, preventingACTIVATION_MAX_RESTARTS_ON_FAILUREfrom ever tripping for activations that crash after reaching RUNNING status_reset_failure_count()call from_detect_running_status()and the now-dead method (7 lines removed)test_monitor_preserves_failure_count_on_running_transitiontotest_manager.pyDetails
When an activation instance survived long enough to receive its first heartbeat (transitioning from STARTING to RUNNING), the
failure_countwas reset to 0. If the activation then crashed (e.g., after 15 minutes due to cascading job failures or liveness probe timeouts), the counter started over from scratch. The circuit breaker threshold was never reached, causing infinite restart loops and cluster-wide resource exhaustion.Customer impact: In Case 04460263, an activation reached 205 failed instances over two days with
ACTIVATION_MAX_RESTARTS_ON_FAILURE: 5active but ignored.failure_countis still properly reset by user-initiated actions:api/views/activation.py)tasks/project.py)Verification
Reproduced and verified on local podman pods:
failure_count=4, watch restart cycleTest plan
test_monitor_preserves_failure_count_on_running_transitionverifiesfailure_countis preserved during STARTING→RUNNING transitiontest_monitor_to_running_statusstill passes (only checksrestart_count, unaffected)test_projects.py) still pass — those test the legitimate reset pathsFixes: AAP-78375
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests