You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(handoff): integrate beyond-handoff for zero-downtime restarts
Mirrors kv's handoff integration. A supervisor holds the listener FD
across binary swaps; an `Incumbent::serve` control thread drains
in-flight requests and persists the fjall index before signalling the
successor to take over. The kernel SYN queue absorbs new connections
during the drain window so no client sees a connection refused.
Drainable bridge (ObjectsHandoff):
- drain() suspends accepts via accept_closed and polls
http_connections_active until 0 or deadline. SyncGroup already syncs
each upload per-request, so no fan-out is needed.
- seal() calls Index::persist(SyncAll) — defensive durability flush
(fjall is durable per-write).
- resume_after_abort() clears accept_closed and bumps rolled_back.
PausableListener wraps tokio::net::TcpListener and implements
axum::serve::Listener so accept() suspends while accept_closed is set.
The TLS accept loop has the same gate inlined.
serve() flow: role detection → DataDirLock → inherited-or-fresh
listener → spawn_blocking(Incumbent::serve) → axum::serve with unified
signal-or-commit shutdown. std::process::exit(0) at the end so SIGTERM
actually terminates the process (the blocking thread cannot be
cancelled by tokio runtime drop). otel_guard is dropped explicitly
before exit so traces flush. Best-effort mkdir on the handoff socket's
parent dir so fresh deploys don't hit a confusing EACCES.
Tests (cargo test --test handoff_smoke handoff_e2e handoff_rigorous):
- handoff_smoke: protocol drive, SIGTERM exit-time guard,
clean-shutdown latency guard.
- handoff_e2e: data_survives_handoff, back_to_back_handoffs.
- handoff_rigorous: 14 scenarios — durability under uploader load,
flock invariant, stale-lock recovery, abort-pre-Ready, seal-failure
retry, supervisor crash mid-handoff, concurrent calls, metrics on
abort, supervisor restart, 10-handoff soak under load, multipart
across handoff, 4 MB streaming bodies through handoff, mTLS handoff,
50-object crash-restart durability.
CI: new mise task `test:handoff:rs` wired into ci.yml.
Side fixes:
- crates/storage/src/write.rs: pre-existing build error in the Linux
IfNoneMatch branch where `xattr::set_object` returned StorageError
but the spawn_blocking closure needed io::Error.
- sdk/ts/__tests__/{global-setup,tls.test}.ts: set
OBJECTS_HANDOFF_SOCKET_PATH per test so they don't try to bind
/run/beyond/objects/control.sock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|`OTLP_SAMPLE_RATE`|`0.1`| Fraction of traces sampled (0.0 = never, 1.0 = always); only effective when OTLP_ENABLED |
334
+
|`OBJECTS_URL`| (none) | Public base URL for object URLs returned by SDK `client.url(key)`|
335
+
|`SYNC_LINGER_MS`|`5`| fdatasync batching window; 0 = inline sync per upload (see Sync Linger Batching) |
336
+
|`DRAIN_TIMEOUT_SECS`|`30`| Seconds to wait for in-flight requests to drain after shutdown signal; 0 = wait forever |
337
+
|`GC_TEMP_TTL_SECS`|`3600`| Min age for `.tmp/` orphans to be eligible for startup GC |
338
+
|`GC_MULTIPART_TTL_SECS`|`86400`| Min age for incomplete multipart uploads to be eligible for startup GC |
339
+
|`OBJECTS_HANDOFF_SOCKET_PATH`|`/run/beyond/objects/control.sock`| Unix-domain socket where the handoff supervisor connects to drive zero-downtime swaps |
@@ -350,6 +351,65 @@ Object write routes (PUT, multipart part upload) remove both the body limit and
350
351
| Disk full | write_object fails during streaming; temp file removed | 500 returned; no partial state visible |
351
352
| xattr not supported | Startup will fail on first write attempt | GlideFS always supports xattrs; local ext4/apfs also work |
352
353
354
+
## Zero-downtime Restarts (Handoff)
355
+
356
+
`beyond-objects` integrates the in-house `beyond-handoff` library for binary swaps without dropping the kernel SYN queue. The integration mirrors `beyond-kv`'s ([sibling cohesion](../kv/ARCHITECTURE.md)), adapted to objects' single-tokio-runtime, single-listener shape.
357
+
358
+
### Roles and process layout
359
+
360
+
A handoff involves three principals:
361
+
362
+
-**Supervisor (S)** — long-running parent; binds the listener once, holds its FD across the swap, spawns successors via `fork+exec`.
363
+
-**Incumbent (O)** — the currently-serving process. Holds the data-dir flock; runs an `Incumbent::serve` control thread (on tokio's blocking pool) that talks to S over a Unix-domain control socket.
364
+
-**Successor (N)** — spawned by S during a handoff with `HANDOFF_ROLE=successor` and the inherited listener FD in slot 3. Compile-time-ordered state machine (`Successor → HandshookSuccessor → BegunSuccessor`) gates startup on the protocol.
365
+
366
+
`detect_role()` at the top of `serve()` decides which path runs. ColdStart consumes any `LISTEN_FDS` env vars (the supervisor's first spawn); Successor handshakes, then blocks on `wait_for_begin()` until S says O has finished `seal`.
367
+
368
+
### Lifecycle on each handoff
369
+
370
+
1. S accepts a swap request; spawns N (`fork+exec` with FD slots filled).
371
+
2. N starts, calls `detect_role()` → `Successor`, handshakes with S over its control-socket FD, waits for `Begin`.
- sets `accept_closed = true` (shared with the [`PausableListener`](crates/server/src/handoff.rs) and the TLS accept loop)
375
+
- polls `http_connections_active` until 0 or the deadline
376
+
- replies `Drained`
377
+
- The kernel SYN backlog absorbs incoming connections in this window — they are not dropped, just queued.
378
+
5. S sends `SealRequest`. O calls `Drainable::seal()`, which calls `Index::persist(SyncAll)` (defensive — fjall is durable per-write). The library then releases the data-dir flock and replies `SealComplete`.
379
+
6. S sends `Begin` to N. N acquires the flock (now free), opens its Storage + Index, reconciles, and finally calls `announce_and_bind(snapshot, socket_path, lock)` to send `Ready` and bind the control socket atomically.
380
+
7. S sends `Commit` to O. O's blocking task signals `commit_tx`. The unified shutdown future in `serve()` resolves, axum drains its remaining tasks, the process exits.
381
+
8. The successor's `axum::serve(PausableListener, app).accept()` now drains the SYN backlog. From the kernel's perspective the listener never closed.
382
+
383
+
### Abort path
384
+
385
+
If anything between Begin and Commit fails — N exits before `Ready`, the seal returns an error, or S itself disconnects — the library invokes `Drainable::resume_after_abort()` on O: it clears `accept_closed`, re-acquires the flock (if it was released), and continues serving as the authoritative incumbent. No state was transformed by `seal` that needs rolling back.
-`OBJECTS_TEST_PANIC_BEFORE_READY=1` — successor exits with code 42 after `wait_for_begin` and before `announce_and_bind`. Exercises the supervisor's abort + incumbent's `resume_after_abort` paths against a real process.
401
+
-`OBJECTS_TEST_FAIL_ONCE_FILE=<path>` — on `seal()`, if the named file exists, unlink it and return `Error::Protocol("seal failed: test hook")`. Validates the `SealFailed` recovery path.
402
+
403
+
Both are consumed via `std::env::var` in production code (see `lib.rs:serve()` and `handoff.rs:seal()`); production never sets them.
404
+
405
+
### Why It Behaves This Way
406
+
407
+
**Why the SYN-queue pause instead of closing the listener.** Closing the listener mid-handoff would RST any waiting connect()s. By suspending `accept()` (the `PausableListener::accept` future just sleeps while `accept_closed` is set), the kernel's listen backlog absorbs incoming connections. When the successor's `axum::serve` starts calling `accept()` on the inherited FD, those queued connections drain into the new process with zero client-visible failures.
408
+
409
+
**Why `Index::persist()` in `seal()` even though fjall is durable per-write.** Defense in depth, and a single explicit fsync point makes future durability tunings opt-in rather than opt-out. The cost is one fdatasync on a typically-small journal.
410
+
411
+
**Why `spawn_blocking` and not `std::thread::spawn` for `Incumbent::serve`.** The control thread blocks on `recv` from the Unix socket — exactly the workload tokio's blocking pool is sized for. Putting it there keeps the runtime's worker threads free and means the shutdown story is uniform (the runtime tracks blocking tasks).
0 commit comments