Branch ops are O(num-tables) serialized S3 round-trips: fork ~5s / delete ~14s on a near-empty graph (RFC-013 step 6?)

Writeup on a branch-op latency wall we hit running v0.7.2 in cluster mode, in case it helps prioritize what looks like RFC-013 step 6.

**Short version:** branch ops (create/delete/merge) are dominated by a fixed number of serialized S3 round-trips that scales with the number of tables (node/edge types), not data volume or commit history. There's no concurrency on the manifest/commit-graph paths, and each Lance metadata open builds a fresh (unpooled) client, so every round-trip also pays a cold TLS handshake. On an object store with real RTT this adds up fast.

**Environment:** v0.7.2, cluster mode, storage on Cloudflare R2, server same-region (~17ms warm RTT). Graph: 112 tables (20 node + 92 edge types), 4 commits, ~58 rows, so effectively empty.

**Measured** (instrumented on the live server plus node-side packet capture):

| op | wall | S3 requests | TCP conns | req/table | CPU |
|---|---|---|---|---|---|
| fork (`POST /branches`) | ~5.5s | 448 | 115 | 4.0 | 0.35s (~94% idle) |
| delete (`DELETE /branches/{b}`) | ~13.6s | 1281 | 288 | 11.4 | 0.7s (~94% idle) |

About 94% of wall is I/O wait. The same ops, same 112 tables, against local disk (`file://`) run in 45ms / 55ms; against in-cluster MinIO, 0.25s / 0.46s. So wall-clock is roughly round-trip-count times per-request-latency, and the round-trip count (hundreds, on an empty graph) is the amplifier.

**Code-level (read at tag v0.7.2):**
1. No concurrency on the control plane. The manifest/commit-graph/branch paths are `for … .await` chains; the only `buffer_unordered`/`join_all` in non-test `src/` are in data-fragment staging and `optimize` (e.g. merge's per-table loops at `exec/merge.rs:1549` and `:1747`).
2. No shared `Session` or pooled client for manifest and commit-graph opens (`instrumentation.rs:226` `open_dataset_tracked`, vs the read-path `open_table_dataset` at `:252` which does get the shared Session). That means a cold TLS handshake per open, matching the 115/288 fresh connections we saw.
3. Amplifiers: create re-validates the schema contract and opens the commit graph from scratch twice; delete repeats `Dataset::open` plus `refresh()`; publish scans `__manifest` 3 to 4 times (`db/manifest/publisher.rs:106-117`); merge re-opens coordinators about 6 times in the prelude (`exec/merge.rs:1409-1456`).

**What we did as a stopgap:** we migrated the graph's storage from R2 to an in-cluster MinIO (same node as the server, sub-ms RTT), keeping R2 as an off-site export backup. The idea was to attack the per-request latency rather than the round-trip count, since we can't easily change the latter from outside. It works because the same hundreds of round-trips, when each is sub-ms instead of ~17ms, finish quickly. One thing worth flagging for others on object stores: the cluster ledger's CAS needs `If-Match` on PutObject, MinIO honors it (we verified with a raw conditional-PUT probe and a full init/fork/delete cycle), whereas Hetzner Object Storage (Ceph RGW) rejects it. After cutover, measured on the production server: fork 4.7s to 0.14s, delete 10.3s to 0.26s. The catch is this only hides the root cause: the round-trip count is unchanged, so it will regrow as commit history accumulates, and it forces storage to be co-located.

Glad to share the full packet capture or per-op breakdown, or test a patch against R2.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Branch ops are O(num-tables) serialized S3 round-trips: fork ~5s / delete ~14s on a near-empty graph (RFC-013 step 6?) #310

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

op	wall	S3 requests	TCP conns	req/table	CPU
fork (`POST /branches`)	~5.5s	448	115	4.0	0.35s (~94% idle)
delete (`DELETE /branches/{b}`)	~13.6s	1281	288	11.4	0.7s (~94% idle)

Uh oh!

Branch ops are O(num-tables) serialized S3 round-trips: fork ~5s / delete ~14s on a near-empty graph (RFC-013 step 6?) #310

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions