Skip to content

Commit 3457aed

Browse files
jaredLundeclaude
andcommitted
feat(handoff): integrate beyond-handoff for zero-downtime restarts
Mirrors kv's handoff integration. A supervisor holds the listener FD across binary swaps; an `Incumbent::serve` control thread drains in-flight requests and persists the fjall index before signalling the successor to take over. The kernel SYN queue absorbs new connections during the drain window so no client sees a connection refused. Drainable bridge (ObjectsHandoff): - drain() suspends accepts via accept_closed and polls http_connections_active until 0 or deadline. SyncGroup already syncs each upload per-request, so no fan-out is needed. - seal() calls Index::persist(SyncAll) — defensive durability flush (fjall is durable per-write). - resume_after_abort() clears accept_closed and bumps rolled_back. PausableListener wraps tokio::net::TcpListener and implements axum::serve::Listener so accept() suspends while accept_closed is set. The TLS accept loop has the same gate inlined. serve() flow: role detection → DataDirLock → inherited-or-fresh listener → spawn_blocking(Incumbent::serve) → axum::serve with unified signal-or-commit shutdown. std::process::exit(0) at the end so SIGTERM actually terminates the process (the blocking thread cannot be cancelled by tokio runtime drop). otel_guard is dropped explicitly before exit so traces flush. Best-effort mkdir on the handoff socket's parent dir so fresh deploys don't hit a confusing EACCES. Tests (cargo test --test handoff_smoke handoff_e2e handoff_rigorous): - handoff_smoke: protocol drive, SIGTERM exit-time guard, clean-shutdown latency guard. - handoff_e2e: data_survives_handoff, back_to_back_handoffs. - handoff_rigorous: 14 scenarios — durability under uploader load, flock invariant, stale-lock recovery, abort-pre-Ready, seal-failure retry, supervisor crash mid-handoff, concurrent calls, metrics on abort, supervisor restart, 10-handoff soak under load, multipart across handoff, 4 MB streaming bodies through handoff, mTLS handoff, 50-object crash-restart durability. CI: new mise task `test:handoff:rs` wired into ci.yml. Side fixes: - crates/storage/src/write.rs: pre-existing build error in the Linux IfNoneMatch branch where `xattr::set_object` returned StorageError but the spawn_blocking closure needed io::Error. - sdk/ts/__tests__/{global-setup,tls.test}.ts: set OBJECTS_HANDOFF_SOCKET_PATH per test so they don't try to bind /run/beyond/objects/control.sock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bb3ce9f commit 3457aed

24 files changed

Lines changed: 3946 additions & 1010 deletions

File tree

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ jobs:
2121
run: mise run test:unit:rs
2222
- name: test:integration:rs
2323
run: mise run test:integration:rs
24+
- name: test:handoff:rs
25+
run: mise run test:handoff:rs
2426
- name: check:ts
2527
run: mise run check:ts
2628
- name: build:rs:release

ARCHITECTURE.md

Lines changed: 76 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -321,22 +321,23 @@ Object write routes (PUT, multipart part upload) remove both the body limit and
321321

322322
## Configuration
323323

324-
| Variable | Default | What It Controls at Runtime |
325-
| ----------------------- | ----------------------- | ---------------------------------------------------------------------------------------- |
326-
| `OBJECTS_ROOT_TOKEN` | (required) | Root auth token; also the HMAC key for derived tokens |
327-
| `OBJECTS_DATA_DIR` | `/data` | Root directory for all bucket subdirectories and `.tmp/` |
328-
| `OBJECTS_INDEX_DIR` | `/data/.index` | Where fjall writes its LSM-tree files |
329-
| `ADDRESS` | `0.0.0.0:9000` | Public bind address for REST, S3, `/metrics`, and health probes |
330-
| `LOG_LEVEL` | `info` | `tracing` filter directive |
331-
| `OTLP_ENABLED` | `false` | Whether to export traces to `OTLP_ENDPOINT` |
332-
| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP collector gRPC address |
333-
| `OTLP_SAMPLE_RATE` | `0.1` | Fraction of traces sampled (0.0 = never, 1.0 = always); only effective when OTLP_ENABLED |
334-
| `OBJECTS_URL` | (none) | Public base URL for object URLs returned by SDK `client.url(key)` |
335-
| `SYNC_LINGER_MS` | `5` | fdatasync batching window; 0 = inline sync per upload (see Sync Linger Batching) |
336-
| `DRAIN_TIMEOUT_SECS` | `30` | Seconds to wait for in-flight requests to drain after shutdown signal; 0 = wait forever |
337-
| `GC_TEMP_TTL_SECS` | `3600` | Min age for `.tmp/` orphans to be eligible for startup GC |
338-
| `GC_MULTIPART_TTL_SECS` | `86400` | Min age for incomplete multipart uploads to be eligible for startup GC |
339-
| `ENVIRONMENT` | (none) | `development` enables pretty log output |
324+
| Variable | Default | What It Controls at Runtime |
325+
| ----------------------------- | ---------------------------------- | ---------------------------------------------------------------------------------------- |
326+
| `OBJECTS_ROOT_TOKEN` | (required) | Root auth token; also the HMAC key for derived tokens |
327+
| `OBJECTS_DATA_DIR` | `/data` | Root directory for all bucket subdirectories and `.tmp/` |
328+
| `OBJECTS_INDEX_DIR` | `/data/.index` | Where fjall writes its LSM-tree files |
329+
| `ADDRESS` | `0.0.0.0:9000` | Public bind address for REST, S3, `/metrics`, and health probes |
330+
| `LOG_LEVEL` | `info` | `tracing` filter directive |
331+
| `OTLP_ENABLED` | `false` | Whether to export traces to `OTLP_ENDPOINT` |
332+
| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP collector gRPC address |
333+
| `OTLP_SAMPLE_RATE` | `0.1` | Fraction of traces sampled (0.0 = never, 1.0 = always); only effective when OTLP_ENABLED |
334+
| `OBJECTS_URL` | (none) | Public base URL for object URLs returned by SDK `client.url(key)` |
335+
| `SYNC_LINGER_MS` | `5` | fdatasync batching window; 0 = inline sync per upload (see Sync Linger Batching) |
336+
| `DRAIN_TIMEOUT_SECS` | `30` | Seconds to wait for in-flight requests to drain after shutdown signal; 0 = wait forever |
337+
| `GC_TEMP_TTL_SECS` | `3600` | Min age for `.tmp/` orphans to be eligible for startup GC |
338+
| `GC_MULTIPART_TTL_SECS` | `86400` | Min age for incomplete multipart uploads to be eligible for startup GC |
339+
| `OBJECTS_HANDOFF_SOCKET_PATH` | `/run/beyond/objects/control.sock` | Unix-domain socket where the handoff supervisor connects to drive zero-downtime swaps |
340+
| `ENVIRONMENT` | (none) | `development` enables pretty log output |
340341

341342
## Failure Modes
342343

@@ -350,6 +351,65 @@ Object write routes (PUT, multipart part upload) remove both the body limit and
350351
| Disk full | write_object fails during streaming; temp file removed | 500 returned; no partial state visible |
351352
| xattr not supported | Startup will fail on first write attempt | GlideFS always supports xattrs; local ext4/apfs also work |
352353

354+
## Zero-downtime Restarts (Handoff)
355+
356+
`beyond-objects` integrates the in-house `beyond-handoff` library for binary swaps without dropping the kernel SYN queue. The integration mirrors `beyond-kv`'s ([sibling cohesion](../kv/ARCHITECTURE.md)), adapted to objects' single-tokio-runtime, single-listener shape.
357+
358+
### Roles and process layout
359+
360+
A handoff involves three principals:
361+
362+
- **Supervisor (S)** — long-running parent; binds the listener once, holds its FD across the swap, spawns successors via `fork+exec`.
363+
- **Incumbent (O)** — the currently-serving process. Holds the data-dir flock; runs an `Incumbent::serve` control thread (on tokio's blocking pool) that talks to S over a Unix-domain control socket.
364+
- **Successor (N)** — spawned by S during a handoff with `HANDOFF_ROLE=successor` and the inherited listener FD in slot 3. Compile-time-ordered state machine (`Successor → HandshookSuccessor → BegunSuccessor`) gates startup on the protocol.
365+
366+
`detect_role()` at the top of `serve()` decides which path runs. ColdStart consumes any `LISTEN_FDS` env vars (the supervisor's first spawn); Successor handshakes, then blocks on `wait_for_begin()` until S says O has finished `seal`.
367+
368+
### Lifecycle on each handoff
369+
370+
1. S accepts a swap request; spawns N (`fork+exec` with FD slots filled).
371+
2. N starts, calls `detect_role()``Successor`, handshakes with S over its control-socket FD, waits for `Begin`.
372+
3. S sends `PrepareHandoff` to O.
373+
4. O's `Incumbent::serve` loop calls `Drainable::drain(deadline)`:
374+
- sets `accept_closed = true` (shared with the [`PausableListener`](crates/server/src/handoff.rs) and the TLS accept loop)
375+
- polls `http_connections_active` until 0 or the deadline
376+
- replies `Drained`
377+
- The kernel SYN backlog absorbs incoming connections in this window — they are not dropped, just queued.
378+
5. S sends `SealRequest`. O calls `Drainable::seal()`, which calls `Index::persist(SyncAll)` (defensive — fjall is durable per-write). The library then releases the data-dir flock and replies `SealComplete`.
379+
6. S sends `Begin` to N. N acquires the flock (now free), opens its Storage + Index, reconciles, and finally calls `announce_and_bind(snapshot, socket_path, lock)` to send `Ready` and bind the control socket atomically.
380+
7. S sends `Commit` to O. O's blocking task signals `commit_tx`. The unified shutdown future in `serve()` resolves, axum drains its remaining tasks, the process exits.
381+
8. The successor's `axum::serve(PausableListener, app).accept()` now drains the SYN backlog. From the kernel's perspective the listener never closed.
382+
383+
### Abort path
384+
385+
If anything between Begin and Commit fails — N exits before `Ready`, the seal returns an error, or S itself disconnects — the library invokes `Drainable::resume_after_abort()` on O: it clears `accept_closed`, re-acquires the flock (if it was released), and continues serving as the authoritative incumbent. No state was transformed by `seal` that needs rolling back.
386+
387+
### Where the code lives
388+
389+
| Concern | File |
390+
| -------------------------------------------------- | ------------------------------------------------------------- |
391+
| `Drainable` impl + `PausableListener` | `crates/server/src/handoff.rs` |
392+
| Role detection, control-socket bind, serve wire-up | `crates/server/src/lib.rs::serve()` |
393+
| `accept_closed` pause check in TLS path | `crates/server/src/lib.rs::serve_tls()` |
394+
| Defensive durability flush in `seal()` | `crates/index/src/lib.rs::Index::persist()` |
395+
| Metrics | `crates/server/src/metrics.rs` (`handoff_*` family) |
396+
| Config | `crates/server/src/config.rs` (`OBJECTS_HANDOFF_SOCKET_PATH`) |
397+
398+
### Test-only env hooks
399+
400+
- `OBJECTS_TEST_PANIC_BEFORE_READY=1` — successor exits with code 42 after `wait_for_begin` and before `announce_and_bind`. Exercises the supervisor's abort + incumbent's `resume_after_abort` paths against a real process.
401+
- `OBJECTS_TEST_FAIL_ONCE_FILE=<path>` — on `seal()`, if the named file exists, unlink it and return `Error::Protocol("seal failed: test hook")`. Validates the `SealFailed` recovery path.
402+
403+
Both are consumed via `std::env::var` in production code (see `lib.rs:serve()` and `handoff.rs:seal()`); production never sets them.
404+
405+
### Why It Behaves This Way
406+
407+
**Why the SYN-queue pause instead of closing the listener.** Closing the listener mid-handoff would RST any waiting connect()s. By suspending `accept()` (the `PausableListener::accept` future just sleeps while `accept_closed` is set), the kernel's listen backlog absorbs incoming connections. When the successor's `axum::serve` starts calling `accept()` on the inherited FD, those queued connections drain into the new process with zero client-visible failures.
408+
409+
**Why `Index::persist()` in `seal()` even though fjall is durable per-write.** Defense in depth, and a single explicit fsync point makes future durability tunings opt-in rather than opt-out. The cost is one fdatasync on a typically-small journal.
410+
411+
**Why `spawn_blocking` and not `std::thread::spawn` for `Incumbent::serve`.** The control thread blocks on `recv` from the Unix socket — exactly the workload tokio's blocking pool is sized for. Putting it there keeps the runtime's worker threads free and means the shutdown story is uniform (the runtime tracks blocking tasks).
412+
353413
## Why It Behaves This Way
354414

355415
### Why the filesystem is the database

0 commit comments

Comments
 (0)