record circuit-breaker success only on the actual hedged winner#14
record circuit-breaker success only on the actual hedged winner#14HrachShah wants to merge 1 commit into
Conversation
The old execute_hedged() looped every involved provider's circuit breaker and called record_success() on all of them, including the loser(s). The loser was either cancelled mid-flight or completed after the winner had already returned. Either way the loser's provider did NOT serve the user,
There was a problem hiding this comment.
Sorry @HrachShah, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
📝 WalkthroughWalkthrough
ChangesWinner-only circuit breaker credit in hedged execution
Sequence Diagram(s)sequenceDiagram
rect rgba(173, 216, 230, 0.5)
note over execute_hedged,LoserCB: Success path
end
participant Caller
participant execute_hedged
participant hedged_execute
participant WinnerCB as winner CB
participant LoserCB as loser CB
Caller->>execute_hedged: hedged request
execute_hedged->>hedged_execute: providers, request
hedged_execute-->>execute_hedged: HedgedResult(response, winner_name, loser_names)
execute_hedged->>WinnerCB: record_success()
execute_hedged->>LoserCB: record_failure(None)
execute_hedged-->>Caller: result.response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unit/test_hedging_winner_only_success.py (1)
62-63: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDocstring contradicts the actual assertion.
The docstring says B's circuit "stays OPEN", but the test sleeps past
recovery_timeout(auto-transition to HALF_OPEN) and only asserts!= CLOSED(Lines 100-102). Align the wording with the HALF_OPEN expectation to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_hedging_winner_only_success.py` around lines 62 - 63, The docstring at the beginning of the test function states that B's circuit "stays OPEN", but the test code sleeps past recovery_timeout (which causes an automatic state transition) and the assertion on lines 100-102 only verifies the state is not CLOSED. Update the docstring to accurately reflect that B's circuit transitions to HALF_OPEN due to the recovery timeout, aligning the documentation with the actual behavior being tested by the assertions.freerelay/core/execution/hedging.py (1)
115-124: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRedundant drain loop over
donelosers.These
loser_providersare taken from thedoneset withexc is None, and their results were already retrieved viatask.exception()in the loop above. They are notpending, so the "Task was destroyed but it is pending!" rationale doesn't apply, and the nestedO(losers × done)scan plust.result()is a no-op for already-completed successful tasks.♻️ Suggested simplification
if winner_task is not None: logger.info("Hedged winner: %s", winner_provider.name) - # Drain any leftover successful losers so their task objects - # aren't GC'd in a 'pending result' state — that's a common - # asyncio footgun that produces "Task was destroyed but it is - # pending!" warnings at shutdown. - for loser in loser_providers: - for t in done: - if tasks.get(t) is loser: - with contextlib.suppress(asyncio.CancelledError, Exception): - t.result() - break return HedgedResult( response=winner_task.result(), winner_name=winner_provider.name, loser_names=tuple(p.name for p in loser_providers), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@freerelay/core/execution/hedging.py` around lines 115 - 124, The nested loop block iterating over loser_providers and done to call t.result() on already-completed successful tasks is redundant. Since these successful losers had their results already retrieved via task.exception() in the preceding loop above and are not pending, this drain serves no purpose while creating unnecessary O(losers × done) complexity. Remove the entire nested loop block (the for loser in loser_providers loop through the break statement) that scans the done set looking for matching task objects and calls t.result(), as it provides no functional benefit for tasks that have already completed successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@freerelay/core/execution/hedging.py`:
- Around line 88-91: In the pending task cancellation loop starting with `for
task in pending:`, the current code only suppresses `asyncio.CancelledError`,
but if a pending task raises a non-CancelledError exception after cancellation,
it will propagate and escape the function before the winner is determined.
Instead of only suppressing CancelledError in the `await task` statement,
capture and log any unexpected exceptions that occur during the cancel/drain
phase, then suppress them so the function can continue to determine the winner.
- Around line 100-129: The HedgedResult class needs to be extended to track
failed losers (providers that completed with exceptions) along with their
exceptions, since currently only successful losers are included in loser_names
while failed losers are silently dropped from the errors list when a winner
exists. Modify the HedgedResult dataclass to add a new field for failed losers
with their exceptions, then in the hedging logic around lines 100-129, populate
this field by tracking which loser providers had exceptions, and include this
data when returning the HedgedResult. Finally, update the execute_hedged
function in executor.py to iterate over these failed losers and record their
failures in the circuit breaker feedback to prevent losing track of provider
degradation.
In `@tests/unit/test_hedging_winner_only_success.py`:
- Around line 91-93: The assertion message in the test exceeds the 88-character
line limit (currently 99 characters on line 92). Split the f-string message
across multiple lines to comply with the linting requirement. You can either
wrap the string by breaking it into concatenated parts or restructure how the
assertion message is formatted to keep each line under 88 characters while
maintaining the readability and meaning of the assertion about a_breaker.state
and CircuitState.CLOSED.
---
Nitpick comments:
In `@freerelay/core/execution/hedging.py`:
- Around line 115-124: The nested loop block iterating over loser_providers and
done to call t.result() on already-completed successful tasks is redundant.
Since these successful losers had their results already retrieved via
task.exception() in the preceding loop above and are not pending, this drain
serves no purpose while creating unnecessary O(losers × done) complexity. Remove
the entire nested loop block (the for loser in loser_providers loop through the
break statement) that scans the done set looking for matching task objects and
calls t.result(), as it provides no functional benefit for tasks that have
already completed successfully.
In `@tests/unit/test_hedging_winner_only_success.py`:
- Around line 62-63: The docstring at the beginning of the test function states
that B's circuit "stays OPEN", but the test code sleeps past recovery_timeout
(which causes an automatic state transition) and the assertion on lines 100-102
only verifies the state is not CLOSED. Update the docstring to accurately
reflect that B's circuit transitions to HALF_OPEN due to the recovery timeout,
aligning the documentation with the actual behavior being tested by the
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef72af54-6eab-44fc-be71-2dd4c3c4517a
📒 Files selected for processing (3)
freerelay/core/execution/executor.pyfreerelay/core/execution/hedging.pytests/unit/test_hedging_winner_only_success.py
| for task in pending: | ||
| task.cancel() | ||
| with contextlib.suppress(asyncio.CancelledError, Exception): | ||
| with contextlib.suppress(asyncio.CancelledError): | ||
| await task |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Non-CancelledError from a cancelled task can escape and mask the winner.
If a pending task ignores/suppresses cancellation and then raises a non-CancelledError exception, await task re-raises it here, propagating out of hedged_execute before the winner is even determined — discarding a valid winning response. The comment's stated scenario (a task that "raised before we got here") can't apply to a pending task, since a task that already raised is in done, not pending. Consider logging and suppressing unexpected errors from the cancel/drain phase instead of letting them propagate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@freerelay/core/execution/hedging.py` around lines 88 - 91, In the pending
task cancellation loop starting with `for task in pending:`, the current code
only suppresses `asyncio.CancelledError`, but if a pending task raises a
non-CancelledError exception after cancellation, it will propagate and escape
the function before the winner is determined. Instead of only suppressing
CancelledError in the `await task` statement, capture and log any unexpected
exceptions that occur during the cancel/drain phase, then suppress them so the
function can continue to determine the winner.
| loser_providers: list[BaseProvider] = [] | ||
| errors: list[BaseException] = [] | ||
|
|
||
| for task in done: | ||
| if task.exception() is None: | ||
| winner = tasks[task] | ||
| logger.info("Hedged winner: %s", winner.name) | ||
| # Cancel any remaining done tasks that weren't checked yet | ||
| return task.result() | ||
| errors.append(task.exception()) # type: ignore[arg-type] | ||
| exc = task.exception() | ||
| if exc is None and winner_task is None: | ||
| winner_task = task | ||
| winner_provider = tasks[task] | ||
| elif exc is None: | ||
| loser_providers.append(tasks[task]) | ||
| else: | ||
| errors.append(exc) | ||
|
|
||
| if winner_task is not None: | ||
| logger.info("Hedged winner: %s", winner_provider.name) | ||
| # Drain any leftover successful losers so their task objects | ||
| # aren't GC'd in a 'pending result' state — that's a common | ||
| # asyncio footgun that produces "Task was destroyed but it is | ||
| # pending!" warnings at shutdown. | ||
| for loser in loser_providers: | ||
| for t in done: | ||
| if tasks.get(t) is loser: | ||
| with contextlib.suppress(asyncio.CancelledError, Exception): | ||
| t.result() | ||
| break | ||
| return HedgedResult( | ||
| response=winner_task.result(), | ||
| winner_name=winner_provider.name, | ||
| loser_names=tuple(p.name for p in loser_providers), | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm executor is the only consumer of hedged_execute / HedgedResult
rg -nP '\bhedged_execute\s*\(' --type=py
rg -nP '\bHedgedResult\b' --type=pyRepository: HrachShah/FreeRelay
Length of output: 157
🏁 Script executed:
#!/bin/bash
# First, find the hedging.py file and explore the directory structure
find . -name "hedging.py" -type f 2>/dev/null | head -20
ls -la freerelay/core/execution/ 2>/dev/null || echo "Path not found, searching for execution directory"
find . -type d -name "execution" 2>/dev/null | head -10Repository: HrachShah/FreeRelay
Length of output: 794
🏁 Script executed:
#!/bin/bash
# Search for files that might contain HedgedResult or hedged_execute
rg "HedgedResult|hedged_execute" --type py -l 2>/dev/null | head -20Repository: HrachShah/FreeRelay
Length of output: 157
🏁 Script executed:
#!/bin/bash
# Broader search to understand project structure
git ls-files | grep -E "(hedging|executor)" | head -20Repository: HrachShah/FreeRelay
Length of output: 339
🏁 Script executed:
#!/bin/bash
# Read hedging.py to verify the code at lines 100-129 and see HedgedResult definition
cat -n freerelay/core/execution/hedging.pyRepository: HrachShah/FreeRelay
Length of output: 5581
🏁 Script executed:
#!/bin/bash
# Read executor.py to verify how HedgedResult is used
cat -n freerelay/core/execution/executor.pyRepository: HrachShah/FreeRelay
Length of output: 8380
🏁 Script executed:
#!/bin/bash
# Verify if there are any tests that show the intended behavior
cat -n tests/unit/test_hedging_winner_only_success.pyRepository: HrachShah/FreeRelay
Length of output: 5512
Failed loser providers are silently dropped when a winner exists, preventing their circuit breakers from recording the failure.
When a winner completes successfully, loser_providers only collects done tasks where exc is None (line 109 of hedging.py); a loser that completed with an exception is pushed into the errors list (line 111), which is consumed only when no winner is found (line 133). This means a co-running loser that returned a real error has its exception swallowed and is not included in loser_names (line 128).
In executor.py (lines 171–179), execute_hedged iterates only over result.loser_names to record circuit breaker feedback. Losers that failed are not in loser_names, so their circuit breakers never observe their failure. A struggling provider that errors while another provider wins will have its failure silently discarded, breaking the circuit breaker's ability to track degradation.
Consider exposing failed losers (with their status/exception) on HedgedResult so the executor can record a real failure for them and update their circuit breakers accordingly.
🧰 Tools
🪛 GitHub Actions: CI / 0_typecheck.txt
[error] 114-114: mypy error: Item "None" of "BaseProvider | None" has no attribute "name" [union-attr]
[error] 127-127: mypy error: Item "None" of "BaseProvider | None" has no attribute "name" [union-attr]
🪛 GitHub Actions: CI / typecheck
[error] 114-114: mypy: Item "None" of "BaseProvider | None" has no attribute "name" [union-attr]
[error] 127-127: mypy: Item "None" of "BaseProvider | None" has no attribute "name" [union-attr]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@freerelay/core/execution/hedging.py` around lines 100 - 129, The HedgedResult
class needs to be extended to track failed losers (providers that completed with
exceptions) along with their exceptions, since currently only successful losers
are included in loser_names while failed losers are silently dropped from the
errors list when a winner exists. Modify the HedgedResult dataclass to add a new
field for failed losers with their exceptions, then in the hedging logic around
lines 100-129, populate this field by tracking which loser providers had
exceptions, and include this data when returning the HedgedResult. Finally,
update the execute_hedged function in executor.py to iterate over these failed
losers and record their failures in the circuit breaker feedback to prevent
losing track of provider degradation.
| assert a_breaker.state == CircuitState.CLOSED, ( | ||
| f"winner A should have success recorded and circuit reset to CLOSED, got {a_breaker.state}" | ||
| ) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix CI lint failure: line exceeds 88 chars (E501).
CI fails on Line 92 (99 > 88). Wrap the assertion message.
🧵 Proposed fix
- assert a_breaker.state == CircuitState.CLOSED, (
- f"winner A should have success recorded and circuit reset to CLOSED, got {a_breaker.state}"
- )
+ assert a_breaker.state == CircuitState.CLOSED, (
+ "winner A should have success recorded and circuit reset to CLOSED, "
+ f"got {a_breaker.state}"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert a_breaker.state == CircuitState.CLOSED, ( | |
| f"winner A should have success recorded and circuit reset to CLOSED, got {a_breaker.state}" | |
| ) | |
| assert a_breaker.state == CircuitState.CLOSED, ( | |
| "winner A should have success recorded and circuit reset to CLOSED, " | |
| f"got {a_breaker.state}" | |
| ) |
🧰 Tools
🪛 GitHub Actions: CI / 1_lint.txt
[error] 92-92: E501 Line too long (99 > 88)
🪛 GitHub Actions: CI / lint
[error] 92-92: E501 Line too long (99 > 88)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_hedging_winner_only_success.py` around lines 91 - 93, The
assertion message in the test exceeds the 88-character line limit (currently 99
characters on line 92). Split the f-string message across multiple lines to
comply with the linting requirement. You can either wrap the string by breaking
it into concatenated parts or restructure how the assertion message is formatted
to keep each line under 88 characters while maintaining the readability and
meaning of the assertion about a_breaker.state and CircuitState.CLOSED.
Source: Pipeline failures
Summary
Executor.execute_hedgedwas callingcircuit.record_success()on every involved provider's circuit breaker after a hedged request finished. The loser provider — which was either cancelled mid-flight or completed after the winner had already returned — did NOT serve the user, so marking its circuit CLOSED was wrong. It let a degraded provider appear healthy, silently lengthened the time between failure and circuit-trip, and the next user request to that provider would go to a broken backend that hadn't been probed again.This change:
hedged_executereturn aHedgedResult(response, winner_name, loser_names)so the caller knows which provider actually won.execute_hedgednow records success on the winner's circuit and explicitly records failure (or treats the loser as still-pending) on the losers.contextlib.suppress(asyncio.CancelledError, Exception)swallowed real provider errors during cancel-and-drain.Task was destroyed but it is pending!warnings at shutdown.Tests
test_hedged_winner_records_success_only_on_winner: pre-opens both circuits with 3 failures, sleeps past the recovery window, runs a hedge where A is 10× faster than B, and asserts that A's circuit ends CLOSED with score 1.0 while B's circuit does NOT end CLOSED.test_hedged_all_fail_records_failure_on_both: all providers raise, both circuits stay CLOSED (threshold=5, only 1 failure each).Both pass. The full existing unit suite (85 tests) still passes.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests