Skip to content

Branch ops are O(num-tables) serialized S3 round-trips: fork ~5s / delete ~14s on a near-empty graph (RFC-013 step 6?) #310

Description

@joanreyero

Writeup on a branch-op latency wall we hit running v0.7.2 in cluster mode, in case it helps prioritize what looks like RFC-013 step 6.

Short version: branch ops (create/delete/merge) are dominated by a fixed number of serialized S3 round-trips that scales with the number of tables (node/edge types), not data volume or commit history. There's no concurrency on the manifest/commit-graph paths, and each Lance metadata open builds a fresh (unpooled) client, so every round-trip also pays a cold TLS handshake. On an object store with real RTT this adds up fast.

Environment: v0.7.2, cluster mode, storage on Cloudflare R2, server same-region (~17ms warm RTT). Graph: 112 tables (20 node + 92 edge types), 4 commits, ~58 rows, so effectively empty.

Measured (instrumented on the live server plus node-side packet capture):

op wall S3 requests TCP conns req/table CPU
fork (POST /branches) ~5.5s 448 115 4.0 0.35s (~94% idle)
delete (DELETE /branches/{b}) ~13.6s 1281 288 11.4 0.7s (~94% idle)

About 94% of wall is I/O wait. The same ops, same 112 tables, against local disk (file://) run in 45ms / 55ms; against in-cluster MinIO, 0.25s / 0.46s. So wall-clock is roughly round-trip-count times per-request-latency, and the round-trip count (hundreds, on an empty graph) is the amplifier.

Code-level (read at tag v0.7.2):

  1. No concurrency on the control plane. The manifest/commit-graph/branch paths are for … .await chains; the only buffer_unordered/join_all in non-test src/ are in data-fragment staging and optimize (e.g. merge's per-table loops at exec/merge.rs:1549 and :1747).
  2. No shared Session or pooled client for manifest and commit-graph opens (instrumentation.rs:226 open_dataset_tracked, vs the read-path open_table_dataset at :252 which does get the shared Session). That means a cold TLS handshake per open, matching the 115/288 fresh connections we saw.
  3. Amplifiers: create re-validates the schema contract and opens the commit graph from scratch twice; delete repeats Dataset::open plus refresh(); publish scans __manifest 3 to 4 times (db/manifest/publisher.rs:106-117); merge re-opens coordinators about 6 times in the prelude (exec/merge.rs:1409-1456).

What we did as a stopgap: we migrated the graph's storage from R2 to an in-cluster MinIO (same node as the server, sub-ms RTT), keeping R2 as an off-site export backup. The idea was to attack the per-request latency rather than the round-trip count, since we can't easily change the latter from outside. It works because the same hundreds of round-trips, when each is sub-ms instead of ~17ms, finish quickly. One thing worth flagging for others on object stores: the cluster ledger's CAS needs If-Match on PutObject, MinIO honors it (we verified with a raw conditional-PUT probe and a full init/fork/delete cycle), whereas Hetzner Object Storage (Ceph RGW) rejects it. After cutover, measured on the production server: fork 4.7s to 0.14s, delete 10.3s to 0.26s. The catch is this only hides the root cause: the round-trip count is unchanged, so it will regrow as commit history accumulates, and it forces storage to be co-located.

Glad to share the full packet capture or per-op breakdown, or test a patch against R2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions