Writeup on a branch-op latency wall we hit running v0.7.2 in cluster mode, in case it helps prioritize what looks like RFC-013 step 6.
Short version: branch ops (create/delete/merge) are dominated by a fixed number of serialized S3 round-trips that scales with the number of tables (node/edge types), not data volume or commit history. There's no concurrency on the manifest/commit-graph paths, and each Lance metadata open builds a fresh (unpooled) client, so every round-trip also pays a cold TLS handshake. On an object store with real RTT this adds up fast.
Environment: v0.7.2, cluster mode, storage on Cloudflare R2, server same-region (~17ms warm RTT). Graph: 112 tables (20 node + 92 edge types), 4 commits, ~58 rows, so effectively empty.
Measured (instrumented on the live server plus node-side packet capture):
| op |
wall |
S3 requests |
TCP conns |
req/table |
CPU |
fork (POST /branches) |
~5.5s |
448 |
115 |
4.0 |
0.35s (~94% idle) |
delete (DELETE /branches/{b}) |
~13.6s |
1281 |
288 |
11.4 |
0.7s (~94% idle) |
About 94% of wall is I/O wait. The same ops, same 112 tables, against local disk (file://) run in 45ms / 55ms; against in-cluster MinIO, 0.25s / 0.46s. So wall-clock is roughly round-trip-count times per-request-latency, and the round-trip count (hundreds, on an empty graph) is the amplifier.
Code-level (read at tag v0.7.2):
- No concurrency on the control plane. The manifest/commit-graph/branch paths are
for … .await chains; the only buffer_unordered/join_all in non-test src/ are in data-fragment staging and optimize (e.g. merge's per-table loops at exec/merge.rs:1549 and :1747).
- No shared
Session or pooled client for manifest and commit-graph opens (instrumentation.rs:226 open_dataset_tracked, vs the read-path open_table_dataset at :252 which does get the shared Session). That means a cold TLS handshake per open, matching the 115/288 fresh connections we saw.
- Amplifiers: create re-validates the schema contract and opens the commit graph from scratch twice; delete repeats
Dataset::open plus refresh(); publish scans __manifest 3 to 4 times (db/manifest/publisher.rs:106-117); merge re-opens coordinators about 6 times in the prelude (exec/merge.rs:1409-1456).
What we did as a stopgap: we migrated the graph's storage from R2 to an in-cluster MinIO (same node as the server, sub-ms RTT), keeping R2 as an off-site export backup. The idea was to attack the per-request latency rather than the round-trip count, since we can't easily change the latter from outside. It works because the same hundreds of round-trips, when each is sub-ms instead of ~17ms, finish quickly. One thing worth flagging for others on object stores: the cluster ledger's CAS needs If-Match on PutObject, MinIO honors it (we verified with a raw conditional-PUT probe and a full init/fork/delete cycle), whereas Hetzner Object Storage (Ceph RGW) rejects it. After cutover, measured on the production server: fork 4.7s to 0.14s, delete 10.3s to 0.26s. The catch is this only hides the root cause: the round-trip count is unchanged, so it will regrow as commit history accumulates, and it forces storage to be co-located.
Glad to share the full packet capture or per-op breakdown, or test a patch against R2.
Writeup on a branch-op latency wall we hit running v0.7.2 in cluster mode, in case it helps prioritize what looks like RFC-013 step 6.
Short version: branch ops (create/delete/merge) are dominated by a fixed number of serialized S3 round-trips that scales with the number of tables (node/edge types), not data volume or commit history. There's no concurrency on the manifest/commit-graph paths, and each Lance metadata open builds a fresh (unpooled) client, so every round-trip also pays a cold TLS handshake. On an object store with real RTT this adds up fast.
Environment: v0.7.2, cluster mode, storage on Cloudflare R2, server same-region (~17ms warm RTT). Graph: 112 tables (20 node + 92 edge types), 4 commits, ~58 rows, so effectively empty.
Measured (instrumented on the live server plus node-side packet capture):
POST /branches)DELETE /branches/{b})About 94% of wall is I/O wait. The same ops, same 112 tables, against local disk (
file://) run in 45ms / 55ms; against in-cluster MinIO, 0.25s / 0.46s. So wall-clock is roughly round-trip-count times per-request-latency, and the round-trip count (hundreds, on an empty graph) is the amplifier.Code-level (read at tag v0.7.2):
for … .awaitchains; the onlybuffer_unordered/join_allin non-testsrc/are in data-fragment staging andoptimize(e.g. merge's per-table loops atexec/merge.rs:1549and:1747).Sessionor pooled client for manifest and commit-graph opens (instrumentation.rs:226open_dataset_tracked, vs the read-pathopen_table_datasetat:252which does get the shared Session). That means a cold TLS handshake per open, matching the 115/288 fresh connections we saw.Dataset::openplusrefresh(); publish scans__manifest3 to 4 times (db/manifest/publisher.rs:106-117); merge re-opens coordinators about 6 times in the prelude (exec/merge.rs:1409-1456).What we did as a stopgap: we migrated the graph's storage from R2 to an in-cluster MinIO (same node as the server, sub-ms RTT), keeping R2 as an off-site export backup. The idea was to attack the per-request latency rather than the round-trip count, since we can't easily change the latter from outside. It works because the same hundreds of round-trips, when each is sub-ms instead of ~17ms, finish quickly. One thing worth flagging for others on object stores: the cluster ledger's CAS needs
If-Matchon PutObject, MinIO honors it (we verified with a raw conditional-PUT probe and a full init/fork/delete cycle), whereas Hetzner Object Storage (Ceph RGW) rejects it. After cutover, measured on the production server: fork 4.7s to 0.14s, delete 10.3s to 0.26s. The catch is this only hides the root cause: the round-trip count is unchanged, so it will regrow as commit history accumulates, and it forces storage to be co-located.Glad to share the full packet capture or per-op breakdown, or test a patch against R2.