Skip to content

Feat/gpu doc#1046

Closed
zhongwen666 wants to merge 10 commits into
masterfrom
feat/gpu_doc
Closed

Feat/gpu doc#1046
zhongwen666 wants to merge 10 commits into
masterfrom
feat/gpu_doc

Conversation

@zhongwen666

Copy link
Copy Markdown
Collaborator

close #1044

zhangjaycee and others added 10 commits May 22, 2026 18:06
…restart (#987) (#997)

* test(admin): add SandboxTable reconnect tests with real PG process restart

Covers the scenario where the postgres process is killed and restarted
inside a running container (pg_ctl stop/start), leaving the container
port stable but invalidating existing connections. pool_pre_ping=False
forces the decorator — not the pool — to handle recovery.

* fix(admin): retry SandboxTable ops once on stale connection after DB restart

Adds _retry_on_disconnect decorator applied to all six SandboxTable methods.
Retries once when DBAPIError.connection_invalidated is True, which SQLAlchemy
sets when asyncpg detects "connection is closed" — meaning the query never
executed and is safe to retry. Addresses stale connections caused by DB
process restart or NAT idle timeout dropping the TCP connection.

* test(admin): simulate 3s PG outage to enforce back-off requirement

A bare single-attempt retry fires immediately after the DB stops and
finds it still down.  Only a retry strategy with cumulative back-off
exceeding the 3-second outage window can bridge the gap.

This makes the test RED against the old no-sleep implementation and
GREEN once sufficient exponential back-off is in place.

* fix(admin): retry SandboxTable ops with exponential back-off across DB outages

The retry decorator now spans both failure modes seen during a PG restart:

1. statement-execution path - an already-checked-out connection goes stale
   and asyncpg raises sqlalchemy.exc.InterfaceError / OperationalError
   (DBAPIError subclasses, wrapped by SQLAlchemy's _handle_dbapi_exception).
2. connect path - the pool tries to dial a fresh connection while PG is
   still down; asyncpg raises ConnectionError / ConnectionResetError /
   OSError directly. SQLAlchemy does NOT wrap connect-path failures into
   DBAPIError, so the previous "except DBAPIError" missed this path
   entirely - retries fired only on the first stale-connection error and
   then crashed on the second attempt's connect failure.

Exception set:
  (OperationalError, InterfaceError, DisconnectionError,
   ConnectionError, OSError, asyncio.TimeoutError)

Excluded on purpose: DatabaseError - it would swallow IntegrityError /
ProgrammingError / DataError, all permanent failures that must fast-fail.

ATTEMPTS=4 with exponential back-off (1s, 2s, 4s) gives a cumulative 7s
window, sufficient to bridge typical PG process-restart outages.

(cherry picked from commit f8b456d)

Signed-off-by: Jiachen Zhang <zjc462490@alibaba-inc.com>
 Add `_base: <path>` resolution in RockConfig.from_env() with deep merge
  support for dicts and identity-keyed lists. Multi-region YAML configs can
  now factor out a single base file; previously the `_base` key was silently
  dropped by the kwargs whitelist, leading to dataclass-default fallbacks
  (e.g. redis port=0) at runtime.

  fixes #1004
@zhongwen666 zhongwen666 closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Expose num_gpus and accelerator_type in Sandbox SDK config

4 participants