fix(queueserver): recover from partial bootstrap failures on env-open#65
Merged
Anthony Sligar (sligara7) merged 2 commits intoJun 12, 2026
Conversation
The first env-open against an empty configuration-service registry inserts every device the worker knows about. The previous code used a bare asyncio.gather over those upserts. Two problems: 1. The first failure cancelled the sibling tasks, so the registry was left only partially populated. 2. On the next env-open the 'is empty?' gate saw a populated registry, skipped bootstrap, and the change feed never delivered the missing devices. The system stayed in a permanent partial state until an operator manually intervened. Fix: move bootstrap into a small helper that runs gather with return_exceptions=True, identifies the names that failed, retries just those once, and raises ConfigServiceError naming the still-missing devices and their underlying exceptions if anything is still failing after the retry. The retry is bounded so a genuinely broken upstream fails loudly per the no-silent-fallback rule. Tests in tests/manager/test_config_service.py: - test_sync_propagates_bootstrap_failure -- updated for the new retry-then-raise shape (failure now needs to repeat on retry to escape, and the wrapping error is ConfigServiceError, not the raw ConfigServiceHTTPError from the underlying POST). - test_sync_bootstrap_retries_partial_failure_and_succeeds (new) - test_sync_bootstrap_raises_loudly_with_all_unrecoverable_failures (new; verifies the raised message names every still-missing device). - test_sync_bootstrap_does_not_retry_when_first_attempt_succeeds (new sanity guard). Full tests/manager/test_config_service.py + test_config_service_integration.py: 94 passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens queueserver env-open synchronization with the configuration-service by preventing “partially bootstrapped” registries when the initial empty-registry bootstrap hits transient per-device failures.
Changes:
- Introduces
_bootstrap_with_retry()to run per-device upserts withasyncio.gather(..., return_exceptions=True), retry failed devices once, and raise a singleConfigServiceErrorenumerating remaining failures. - Updates/adds unit tests to cover: partial-failure recovery, bounded retry behavior, and loud error reporting when failures persist.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| backend/queueserver_service/queueserver_service/manager/config_service.py | Adds bootstrap-with-retry helper and routes env-open bootstrap through it to avoid sibling-task cancellation and permanent partial registry states after transient failures. |
| backend/queueserver_service/tests/manager/test_config_service.py | Updates existing bootstrap-failure test and adds new tests for partial-failure retry success, bounded retries, and multi-failure loud errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three review-bot catches on the bootstrap-with-retry helper, all genuine: 1. Docstring said 'retry survivors once'; the code retries the names that failed, not the ones that succeeded. Reworded to 'retry failures once'. 2. The inner _attempt collected any BaseException as a per-device failure, including asyncio.CancelledError, KeyboardInterrupt, and SystemExit. That risked aggregating cancellation/shutdown into a ConfigServiceError and obscuring the real termination reason. Narrowed the failure type to Exception and explicitly re-raise the non-Exception BaseException subclasses so cancellation and process shutdown behave normally. 3. The test docstring claimed ConfigServiceError 'wraps the underlying exception', but the implementation only put exception details in the message — no chaining via __cause__. Made the claim true: chain the first remaining failure as __cause__ on the raised ConfigServiceError. A debugger landing on the raise site now has the original traceback. Test updated to assert __cause__ is set to the underlying ConfigServiceHTTPError. 94 passed (target test files, unchanged count). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When the configuration-service registry is empty, the first env-open inserts every device the worker knows about. If any one of those inserts failed, the registry was left partially populated AND the next env-open never tried to fix it — the system stayed in a permanent partial state until an operator manually intervened.
Why it happened
sync_devices_on_env_openused a bareasyncio.gatherover the per-device upserts. Two problems:Fix
Move the bootstrap into a small helper that:
gatherwithreturn_exceptions=Trueso survivors land instead of being cancelled.ConfigServiceErrornaming every still-missing device and its underlying exception.The retry is bounded (one extra attempt) so a genuinely broken upstream surfaces immediately rather than looping. The loud raise upholds the no-silent-fallback rule.
Symmetric in shape with the env-open lock recovery that just landed in #59.
Tests
In
tests/manager/test_config_service.py:test_sync_propagates_bootstrap_failure— the failure now has to repeat on retry to escape, and the wrapping error isConfigServiceError, not the rawConfigServiceHTTPErrorfrom the underlying POST.test_sync_bootstrap_retries_partial_failure_and_succeeds— m1 succeeds, m2 fails-then-succeeds; no exception escapes, both devices in the registry, cursor read happens.test_sync_bootstrap_raises_loudly_with_all_unrecoverable_failures— both devices fail both attempts; raised message names every still-missing device.test_sync_bootstrap_does_not_retry_when_first_attempt_succeeds— sanity guard against the retry path being entered unnecessarily.Verified the three new/updated tests fail without the fix and pass with it.
Full
tests/manager/test_config_service.py+test_config_service_integration.py: 94 passed.