Skip to content

fix(queueserver): recover from partial bootstrap failures on env-open#65

Merged
Anthony Sligar (sligara7) merged 2 commits into
NSLS2:mainfrom
sligara7:fix/qs-bootstrap-partial-failure-recovery
Jun 12, 2026
Merged

fix(queueserver): recover from partial bootstrap failures on env-open#65
Anthony Sligar (sligara7) merged 2 commits into
NSLS2:mainfrom
sligara7:fix/qs-bootstrap-partial-failure-recovery

Conversation

@sligara7

Copy link
Copy Markdown
Collaborator

What

When the configuration-service registry is empty, the first env-open inserts every device the worker knows about. If any one of those inserts failed, the registry was left partially populated AND the next env-open never tried to fix it — the system stayed in a permanent partial state until an operator manually intervened.

Why it happened

sync_devices_on_env_open used a bare asyncio.gather over the per-device upserts. Two problems:

  1. The first failure cancelled the sibling tasks, so the registry only got some of the devices.
  2. On the next env-open, the "is registry empty?" gate saw a non-empty registry, skipped the bootstrap, and the incremental change feed was never going to deliver the missing devices (they were never there to begin with — there was nothing to "change").

Fix

Move the bootstrap into a small helper that:

  1. Runs gather with return_exceptions=True so survivors land instead of being cancelled.
  2. Collects the names that failed.
  3. Retries just those names exactly once.
  4. If anything is still failing after the retry, raises ConfigServiceError naming every still-missing device and its underlying exception.

The retry is bounded (one extra attempt) so a genuinely broken upstream surfaces immediately rather than looping. The loud raise upholds the no-silent-fallback rule.

Symmetric in shape with the env-open lock recovery that just landed in #59.

Tests

In tests/manager/test_config_service.py:

  • Updated test_sync_propagates_bootstrap_failure — the failure now has to repeat on retry to escape, and the wrapping error is ConfigServiceError, not the raw ConfigServiceHTTPError from the underlying POST.
  • New test_sync_bootstrap_retries_partial_failure_and_succeeds — m1 succeeds, m2 fails-then-succeeds; no exception escapes, both devices in the registry, cursor read happens.
  • New test_sync_bootstrap_raises_loudly_with_all_unrecoverable_failures — both devices fail both attempts; raised message names every still-missing device.
  • New test_sync_bootstrap_does_not_retry_when_first_attempt_succeeds — sanity guard against the retry path being entered unnecessarily.

Verified the three new/updated tests fail without the fix and pass with it.

Full tests/manager/test_config_service.py + test_config_service_integration.py: 94 passed.

The first env-open against an empty configuration-service registry
inserts every device the worker knows about. The previous code used a
bare asyncio.gather over those upserts. Two problems:

1. The first failure cancelled the sibling tasks, so the registry was
   left only partially populated.
2. On the next env-open the 'is empty?' gate saw a populated registry,
   skipped bootstrap, and the change feed never delivered the missing
   devices. The system stayed in a permanent partial state until an
   operator manually intervened.

Fix: move bootstrap into a small helper that runs gather with
return_exceptions=True, identifies the names that failed, retries just
those once, and raises ConfigServiceError naming the still-missing
devices and their underlying exceptions if anything is still failing
after the retry. The retry is bounded so a genuinely broken upstream
fails loudly per the no-silent-fallback rule.

Tests in tests/manager/test_config_service.py:
- test_sync_propagates_bootstrap_failure  -- updated for the new
  retry-then-raise shape (failure now needs to repeat on retry to
  escape, and the wrapping error is ConfigServiceError, not the raw
  ConfigServiceHTTPError from the underlying POST).
- test_sync_bootstrap_retries_partial_failure_and_succeeds (new)
- test_sync_bootstrap_raises_loudly_with_all_unrecoverable_failures
  (new; verifies the raised message names every still-missing device).
- test_sync_bootstrap_does_not_retry_when_first_attempt_succeeds
  (new sanity guard).

Full tests/manager/test_config_service.py + test_config_service_integration.py:
94 passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens queueserver env-open synchronization with the configuration-service by preventing “partially bootstrapped” registries when the initial empty-registry bootstrap hits transient per-device failures.

Changes:

  • Introduces _bootstrap_with_retry() to run per-device upserts with asyncio.gather(..., return_exceptions=True), retry failed devices once, and raise a single ConfigServiceError enumerating remaining failures.
  • Updates/adds unit tests to cover: partial-failure recovery, bounded retry behavior, and loud error reporting when failures persist.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
backend/queueserver_service/queueserver_service/manager/config_service.py Adds bootstrap-with-retry helper and routes env-open bootstrap through it to avoid sibling-task cancellation and permanent partial registry states after transient failures.
backend/queueserver_service/tests/manager/test_config_service.py Updates existing bootstrap-failure test and adds new tests for partial-failure retry success, bounded retries, and multi-failure loud errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/queueserver_service/queueserver_service/manager/config_service.py Outdated
Comment thread backend/queueserver_service/queueserver_service/manager/config_service.py Outdated
Comment thread backend/queueserver_service/tests/manager/test_config_service.py Outdated
Three review-bot catches on the bootstrap-with-retry helper, all
genuine:

1. Docstring said 'retry survivors once'; the code retries the names
   that failed, not the ones that succeeded. Reworded to 'retry
   failures once'.

2. The inner _attempt collected any BaseException as a per-device
   failure, including asyncio.CancelledError, KeyboardInterrupt, and
   SystemExit. That risked aggregating cancellation/shutdown into a
   ConfigServiceError and obscuring the real termination reason.
   Narrowed the failure type to Exception and explicitly re-raise the
   non-Exception BaseException subclasses so cancellation and process
   shutdown behave normally.

3. The test docstring claimed ConfigServiceError 'wraps the underlying
   exception', but the implementation only put exception details in
   the message — no chaining via __cause__. Made the claim true:
   chain the first remaining failure as __cause__ on the raised
   ConfigServiceError. A debugger landing on the raise site now has
   the original traceback. Test updated to assert __cause__ is set to
   the underlying ConfigServiceHTTPError.

94 passed (target test files, unchanged count).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sligara7 Anthony Sligar (sligara7) merged commit e841713 into NSLS2:main Jun 12, 2026
11 checks passed
@sligara7 Anthony Sligar (sligara7) deleted the fix/qs-bootstrap-partial-failure-recovery branch June 12, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants