Feat/gpu doc#1046
Closed
zhongwen666 wants to merge 10 commits into
Closed
Conversation
…restart (#987) (#997) * test(admin): add SandboxTable reconnect tests with real PG process restart Covers the scenario where the postgres process is killed and restarted inside a running container (pg_ctl stop/start), leaving the container port stable but invalidating existing connections. pool_pre_ping=False forces the decorator — not the pool — to handle recovery. * fix(admin): retry SandboxTable ops once on stale connection after DB restart Adds _retry_on_disconnect decorator applied to all six SandboxTable methods. Retries once when DBAPIError.connection_invalidated is True, which SQLAlchemy sets when asyncpg detects "connection is closed" — meaning the query never executed and is safe to retry. Addresses stale connections caused by DB process restart or NAT idle timeout dropping the TCP connection. * test(admin): simulate 3s PG outage to enforce back-off requirement A bare single-attempt retry fires immediately after the DB stops and finds it still down. Only a retry strategy with cumulative back-off exceeding the 3-second outage window can bridge the gap. This makes the test RED against the old no-sleep implementation and GREEN once sufficient exponential back-off is in place. * fix(admin): retry SandboxTable ops with exponential back-off across DB outages The retry decorator now spans both failure modes seen during a PG restart: 1. statement-execution path - an already-checked-out connection goes stale and asyncpg raises sqlalchemy.exc.InterfaceError / OperationalError (DBAPIError subclasses, wrapped by SQLAlchemy's _handle_dbapi_exception). 2. connect path - the pool tries to dial a fresh connection while PG is still down; asyncpg raises ConnectionError / ConnectionResetError / OSError directly. SQLAlchemy does NOT wrap connect-path failures into DBAPIError, so the previous "except DBAPIError" missed this path entirely - retries fired only on the first stale-connection error and then crashed on the second attempt's connect failure. Exception set: (OperationalError, InterfaceError, DisconnectionError, ConnectionError, OSError, asyncio.TimeoutError) Excluded on purpose: DatabaseError - it would swallow IntegrityError / ProgrammingError / DataError, all permanent failures that must fast-fail. ATTEMPTS=4 with exponential back-off (1s, 2s, 4s) gives a cumulative 7s window, sufficient to bridge typical PG process-restart outages. (cherry picked from commit f8b456d) Signed-off-by: Jiachen Zhang <zjc462490@alibaba-inc.com>
Add `_base: <path>` resolution in RockConfig.from_env() with deep merge support for dicts and identity-keyed lists. Multi-region YAML configs can now factor out a single base file; previously the `_base` key was silently dropped by the kwargs whitelist, leading to dataclass-default fallbacks (e.g. redis port=0) at runtime. fixes #1004
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
close #1044