Skip to content

fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596

Open
hsong-rh wants to merge 2 commits into
ansible:mainfrom
hsong-rh:fix/aap-78375-circuit-breaker-bypass
Open

fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596
hsong-rh wants to merge 2 commits into
ansible:mainfrom
hsong-rh:fix/aap-78375-circuit-breaker-bypass

Conversation

@hsong-rh

@hsong-rh hsong-rh commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

https://redhat.atlassian.net/browse/AAP-78375

Summary

  • Root cause: _detect_running_status() unconditionally reset failure_count to 0 on every STARTING→RUNNING transition, preventing ACTIVATION_MAX_RESTARTS_ON_FAILURE from ever tripping for activations that crash after reaching RUNNING status
  • Fix: Remove the _reset_failure_count() call from _detect_running_status() and the now-dead method (7 lines removed)
  • Test: Added test_monitor_preserves_failure_count_on_running_transition to test_manager.py

Details

When an activation instance survived long enough to receive its first heartbeat (transitioning from STARTING to RUNNING), the failure_count was reset to 0. If the activation then crashed (e.g., after 15 minutes due to cascading job failures or liveness probe timeouts), the counter started over from scratch. The circuit breaker threshold was never reached, causing infinite restart loops and cluster-wide resource exhaustion.

Customer impact: In Case 04460263, an activation reached 205 failed instances over two days with ACTIVATION_MAX_RESTARTS_ON_FAILURE: 5 active but ignored.

failure_count is still properly reset by user-initiated actions:

  • User enables a disabled activation (api/views/activation.py)
  • Project sync recovers a failed activation (tasks/project.py)

Verification

Reproduced and verified on local podman pods:

Scenario Before fix After fix
Set failure_count=4, watch restart cycle Reset to 0 on STARTING→RUNNING Preserved at 4 through multiple cycles

Test plan

  • New test: test_monitor_preserves_failure_count_on_running_transition verifies failure_count is preserved during STARTING→RUNNING transition
  • Existing test_monitor_to_running_status still passes (only checks restart_count, unaffected)
  • Project sync tests (test_projects.py) still pass — those test the legitimate reset paths
  • CI: full test suite

Fixes: AAP-78375

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed incorrect reset of failure counts when activations transition to a running state.
  • Tests

    • Added integration test to verify failure count preservation during activation state transitions.

_detect_running_status() unconditionally reset failure_count to 0
on every STARTING→RUNNING transition. This prevented the
ACTIVATION_MAX_RESTARTS_ON_FAILURE circuit breaker from ever
tripping for activations that crash after reaching RUNNING status,
causing infinite restart loops and cluster-wide resource exhaustion.

Remove the _reset_failure_count() call from _detect_running_status()
and the now-dead method. failure_count is still properly reset by
user-initiated actions (enable, project sync recovery).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hsong-rh hsong-rh requested a review from a team as a code owner June 18, 2026 13:35
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 40ac1cf9-2a2c-49ca-a0ff-15916f7a03e9

📥 Commits

Reviewing files that changed from the base of the PR and between cf97072 and dd16491.

📒 Files selected for processing (1)
  • src/aap_eda/services/activation/activation_manager.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aap_eda/services/activation/activation_manager.py

📝 Walkthrough

Walkthrough

_reset_failure_count is removed from ActivationManager. Its call site in _detect_running_status is replaced with a transaction.atomic() block that sets the activation and latest instance status to RUNNING without zeroing failure_count. A new integration test confirms the count is preserved after the transition.

Changes

Preserve failure_count on RUNNING transition

Layer / File(s) Summary
Remove _reset_failure_count and update _detect_running_status
src/aap_eda/services/activation/activation_manager.py
Deletes the private _reset_failure_count helper that zeroed db_instance.failure_count under run_with_lock. In _detect_running_status, the call to that helper is replaced with a transaction.atomic() block that writes ActivationStatus.RUNNING to both the activation and latest instance status, leaving failure_count untouched.
Integration test for preserved failure_count
tests/integration/services/activation/test_manager.py
Adds test_monitor_preserves_failure_count_on_running_transition, which sets a non-zero failure_count on a STARTING activation, mocks the container engine to return a RUNNING status, calls monitor(), and asserts both RUNNING status and the unchanged failure_count.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: removing the failure_count reset that was bypassing the circuit breaker mechanism.
Description check ✅ Passed The description comprehensively covers all required template sections: root cause, fix approach, testing strategy, customer impact, and verification results.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ression

set_latest_instance_status() sets updated_at = models.functions.Now(),
a SQL expression that remains in-memory as a Now() object rather than
a datetime. The subsequent _is_unresponsive() check fails comparing
Now() < datetime. Previously _reset_failure_count()'s @run_with_lock
decorator incidentally called refresh_from_db() which cleared the
cached latest_instance. Add the refresh explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.35%. Comparing base (d2f0c95) to head (dd16491).

@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
- Coverage   92.35%   92.35%   -0.01%     
==========================================
  Files         244      244              
  Lines       11214    11210       -4     
==========================================
- Hits        10357    10353       -4     
  Misses        857      857              
Flag Coverage Δ
unit-int-tests-3.11 92.35% <100.00%> (-0.01%) ⬇️
unit-int-tests-3.12 92.35% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../aap_eda/services/activation/activation_manager.py 62.26% <100.00%> (-0.31%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonarqubecloud

Copy link
Copy Markdown

running_container_status_mock: MagicMock,
):
"""AAP-78375: failure_count must not reset on STARTING->RUNNING."""
starting_activation.failure_count = 3

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the reason for the change. We don't want to restart the failure count if the activation starts --> runs --> 3 seconds later it falls over. What about cases where it runs for hours/days? Shouldn't we restart the count at that point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexSCorey A failure is a failure regardless of its uptime. If an activation fails 5 times over weeks, that's still a pattern worth stopping. The admin can always disable/enable to reset the counter if they know a failure was transient. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants