Drop write-lane wrap from BatchWriteItem entry points#568
Merged
Osvaldo Andrade (osvaldoandrade) merged 1 commit intoJun 23, 2026
Conversation
720da17 routed PutItemWithCtx, DeleteItemWithCtx and BatchWriteItemCtx through the write lane to thread service levels through the DRR scheduler. Under sustained write load this deadlocks: the entry point holds one of N write-lane workers across CommitBatch -> Replicate, which blocks waiting for raft to commit + apply. The leader FSM then calls ApplyCommittedBatch, which itself takes the write lane (TestApplyCommittedBatchUsesWriteLane pins this). With every worker parked at <-req.done in Replicate, the FSM cannot acquire a slot and the round-trip stalls until RPC_TIMEOUT. Restore the #428 invariant: high-level write entries bypass the lane. The lane still throttles FSM applies and Set / Delete, which is where it was designed to live. Ctx variants keep their signatures for cancellation and observability, but no longer reserve a worker. SL threading on writes will need to ride on the batch payload and be dispatched at ApplyCommittedBatch, not at the caller goroutine. Also restore dispatch's workers*2 buffer that 720da17 dropped to 0: without it the dispatcher serializes worker handoffs and loses pipelining; the change was not motivated by the commit. 8-node bench (24 shards, RF=3, 64 write workers, 5m sustained): baseline cab738b: write_only 110 911 rows/s, 0 errors, 5m 720da17 broken: write_only 589 rows/s, 64 errors, 30s abort master + fix: write_only 113 975 rows/s, 0 errors, 5m
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
720da17wrappedPutItemWithCtx,DeleteItemWithCtxandBatchWriteItemCtxinrunWriteCtx, holding a write-lane worker acrossCommitBatch -> Replicate. The leader FSM'sApplyCommittedBatchalso takes the write lane (the [Wave 2] Bypass write-lane in CommitBatch to restore group-commit #428 invariant pinned byTestApplyCommittedBatchUsesWriteLane), so under sustained load every worker is parked in Replicate and the FSM cannot acquire a slot — every write deadlocks untilRPC_TIMEOUT.workers*2dispatch buffer that720da17dropped to 0. The lane still throttles FSM applies andSet/Delete, which is where it was designed to live.Ctxvariants keep their signatures for cancellation and observability.ApplyCommittedBatch, not at the caller goroutine. Tracked as follow-up.Bench (8-node, 24 shards, RF=3, 64 write workers, 5m sustained)
cab738b720da17brokenwrite_onlyrecovery factor: ~193x.Test plan
go test ./internal/storage/adapter/pebble/... -racego test ./internal/server/...scripts/bench/bench_8node_matrix.shPASS on all 5 phases, 0 errors anywhereTestApplyCommittedBatchUsesWriteLanestill pins the FSM-side lane invariantTestCommitBatchBypassesWriteLaneunchangedTestCtxAwareMethodsUseServiceLevelSharesupdated: write-side assertion removed (it was asserting the deadlock-causing behaviour); read-side SL routing still verified