fix: stop circuit breaker bypass via failure_count reset (AAP-78375) by hsong-rh · Pull Request #1596 · ansible/eda-server

hsong-rh · 2026-06-18T13:35:47Z

https://redhat.atlassian.net/browse/AAP-78375

Summary

Root cause: _detect_running_status() unconditionally reset failure_count to 0 on every STARTING→RUNNING transition, preventing ACTIVATION_MAX_RESTARTS_ON_FAILURE from ever tripping for activations that crash after reaching RUNNING status
Fix: Remove the _reset_failure_count() call from _detect_running_status() and the now-dead method (7 lines removed)
Test: Added test_monitor_preserves_failure_count_on_running_transition to test_manager.py

Details

When an activation instance survived long enough to receive its first heartbeat (transitioning from STARTING to RUNNING), the failure_count was reset to 0. If the activation then crashed (e.g., after 15 minutes due to cascading job failures or liveness probe timeouts), the counter started over from scratch. The circuit breaker threshold was never reached, causing infinite restart loops and cluster-wide resource exhaustion.

Customer impact: In Case 04460263, an activation reached 205 failed instances over two days with ACTIVATION_MAX_RESTARTS_ON_FAILURE: 5 active but ignored.

failure_count is still properly reset by user-initiated actions:

User enables a disabled activation (api/views/activation.py)
Project sync recovers a failed activation (tasks/project.py)

Verification

Reproduced and verified on local podman pods:

Scenario	Before fix	After fix
Set `failure_count=4`, watch restart cycle	Reset to 0 on STARTING→RUNNING	Preserved at 4 through multiple cycles

Test plan

New test: test_monitor_preserves_failure_count_on_running_transition verifies failure_count is preserved during STARTING→RUNNING transition
Existing test_monitor_to_running_status still passes (only checks restart_count, unaffected)
Project sync tests (test_projects.py) still pass — those test the legitimate reset paths
CI: full test suite

Fixes: AAP-78375

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed incorrect reset of failure counts when activations transition to a running state.
Tests
- Added integration test to verify failure count preservation during activation state transitions.

_detect_running_status() unconditionally reset failure_count to 0 on every STARTING→RUNNING transition. This prevented the ACTIVATION_MAX_RESTARTS_ON_FAILURE circuit breaker from ever tripping for activations that crash after reaching RUNNING status, causing infinite restart loops and cluster-wide resource exhaustion. Remove the _reset_failure_count() call from _detect_running_status() and the now-dead method. failure_count is still properly reset by user-initiated actions (enable, project sync recovery). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-18T13:36:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 40ac1cf9-2a2c-49ca-a0ff-15916f7a03e9

📥 Commits

Reviewing files that changed from the base of the PR and between cf97072 and dd16491.

📒 Files selected for processing (1)

src/aap_eda/services/activation/activation_manager.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/aap_eda/services/activation/activation_manager.py

📝 Walkthrough

Walkthrough

_reset_failure_count is removed from ActivationManager. Its call site in _detect_running_status is replaced with a transaction.atomic() block that sets the activation and latest instance status to RUNNING without zeroing failure_count. A new integration test confirms the count is preserved after the transition.

Changes

Preserve failure_count on RUNNING transition

Layer / File(s)	Summary
Remove `_reset_failure_count` and update `_detect_running_status` `src/aap_eda/services/activation/activation_manager.py`	Deletes the private `_reset_failure_count` helper that zeroed `db_instance.failure_count` under `run_with_lock`. In `_detect_running_status`, the call to that helper is replaced with a `transaction.atomic()` block that writes `ActivationStatus.RUNNING` to both the activation and latest instance status, leaving `failure_count` untouched.
Integration test for preserved `failure_count` `tests/integration/services/activation/test_manager.py`	Adds `test_monitor_preserves_failure_count_on_running_transition`, which sets a non-zero `failure_count` on a `STARTING` activation, mocks the container engine to return a RUNNING status, calls `monitor()`, and asserts both `RUNNING` status and the unchanged `failure_count`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: removing the failure_count reset that was bypassing the circuit breaker mechanism.
Description check	✅ Passed	The description comprehensively covers all required template sections: root cause, fix approach, testing strategy, customer impact, and verification results.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ression set_latest_instance_status() sets updated_at = models.functions.Now(), a SQL expression that remains in-memory as a Now() object rather than a datetime. The subsequent _is_unresponsive() check fails comparing Now() < datetime. Previously _reset_failure_count()'s @run_with_lock decorator incidentally called refresh_from_db() which cleared the cached latest_instance. Add the refresh explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-18T15:15:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.35%. Comparing base (d2f0c95) to head (dd16491).

@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
- Coverage   92.35%   92.35%   -0.01%     
==========================================
  Files         244      244              
  Lines       11214    11210       -4     
==========================================
- Hits        10357    10353       -4     
  Misses        857      857

Flag	Coverage Δ
unit-int-tests-3.11	`92.35% <100.00%> (-0.01%)`	⬇️
unit-int-tests-3.12	`92.35% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../aap_eda/services/activation/activation_manager.py	`62.26% <100.00%> (-0.31%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sonarqubecloud · 2026-06-18T15:17:16Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

AlexSCorey · 2026-06-18T15:29:16Z

+    running_container_status_mock: MagicMock,
+):
+    """AAP-78375: failure_count must not reset on STARTING->RUNNING."""
+    starting_activation.failure_count = 3


I think I understand the reason for the change. We don't want to restart the failure count if the activation starts --> runs --> 3 seconds later it falls over. What about cases where it runs for hours/days? Shouldn't we restart the count at that point?

@AlexSCorey A failure is a failure regardless of its uptime. If an activation fails 5 times over weeks, that's still a pattern worth stopping. The admin can always disable/enable to reset the counter if they know a failure was transient. What do you think?

hsong-rh requested a review from a team as a code owner June 18, 2026 13:35

AlexSCorey reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596

fix: stop circuit breaker bypass via failure_count reset (AAP-78375)#1596
hsong-rh wants to merge 2 commits into
ansible:mainfrom
hsong-rh:fix/aap-78375-circuit-breaker-bypass

hsong-rh commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 18, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Uh oh!

AlexSCorey Jun 18, 2026

Uh oh!

hsong-rh Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hsong-rh commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Verification

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Quality Gate passed

Uh oh!

AlexSCorey Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

hsong-rh Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hsong-rh commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

codecov-commenter commented Jun 18, 2026 •

edited

Loading