Skip to content

Performance Tuning Knobs

Osvaldo edited this page May 19, 2026 · 1 revision

Tuning Knobs

This page is the catalog of every knob in codeQ that has been observed to move performance, with the file path where each knob lives, the mechanical effect of changing it, the cost of changing it, and a recommended starting value. The knobs split into three families: storage (Pebble), client (workerclient and producerclient), and RAFT replication. Each is covered in a section below. Compile-time constants that govern the coalescers are listed at the end, because operators do not normally change them but readers tracing a bottleneck need to know they exist.

The reference for every storage and RAFT knob is pkg/config/config.go. The reference for client knobs is pkg/workerclient/client.go and pkg/producerclient. The line numbers cited below refer to the source at the time of writing; they are stable enough to use as anchors but will drift as the code evolves.

Storage knobs

The Pebble storage knobs sit inside the PersistenceConfig JSON blob that Config.PersistenceConfig carries (pkg/config/config.go:21). They are passed to the Pebble provider at startup and influence the LSM tree, the commit path, and the shard topology.

numShards is the number of independent Pebble directories the provider opens. Mechanically, setting numShards=4 opens four *pebble.DB instances, each with its own commit pipeline, coalescer, and reaper. Writes are routed across them by FNV-1a over the task ID (internal/repository/pebble/sharded_task_repository.go). The effect on throughput is described in detail on the Multi-Shard Scaling page. The cost is memory (each shard allocates a 256 MiB block cache plus its working set) and admin-query latency (cross-shard scans walk all N shards). The recommended starting value on a twelve-core host is four. On a four-core host, two. On a two-core host or a memory-constrained environment, one. Pushing past four on a twelve-core box does not show measured throughput improvement; pushing past eight is the regime where memory and GC pressure become operational concerns.

fsyncOnCommit controls whether each batch commit issues an fsync(2) on the Pebble WAL before returning. The flag lives in internal/repository/pebble/db.go:137. Default is false, which corresponds to Pebble's NoSync write option. Setting it to true forces an fsync per coalesced commit. Mechanically this means: with NoSync, a kernel panic between commit-return and the kernel's eventual flush loses roughly the last millisecond of writes. With fsync-on-commit, no writes survive past the fsync point and are lost across a panic. The throughput cost depends on the storage device. On a decent NVMe the cost is typically ten to thirty percent of baseline. On consumer SSDs or rotational disks the cost can be an order of magnitude. The coalescer amortizes the fsync across the merged batch (one fsync per coalesced commit, not one per submitter) which is why the worst case is not catastrophic. The recommended starting value is false. Flip it to true only if a kernel-panic-loses-a-millisecond loss model is unacceptable for your workload, and measure the cost on your hardware before committing.

The Pebble path is where the LSM tree lives on disk. The performance-relevant property is that it must be on local fast storage — not NFS, not a network file system, not a remote volume. The Pebble commitPipeline issues system calls per batch, and a network file system that round-trips each syscall will dominate latency over everything else the system does. On a Linux host with a local NVMe, the path choice does not affect throughput in any measurable way. On a virtualized environment with a slow virtual disk, the path latency becomes the floor of every commit.

Client knobs

The worker client at pkg/workerclient/client.go exposes four knobs that control how aggressively each worker drains the queue. They are the most operator-facing performance levers in the system because they sit in the client process, not the server, so changing them does not require a server restart.

Concurrency is the number of in-flight task slots the client maintains (pkg/workerclient/client.go:43). Each slot owns one Ready-Task-Result cycle. Default is one. The bench harness sets it to one hundred twenty-eight (internal/bench/profile_full_cycle_test.go:80). Mechanically, higher concurrency lets the worker hold more outstanding claims against the server's queue, which keeps the queue drained at higher producer rates. The cost is memory (each slot has its own buffers and a goroutine) and the risk that a misbehaving handler can have many tasks in flight simultaneously, increasing the blast radius of a bug. The recommended starting value for production workloads is somewhere between sixteen and one hundred twenty-eight, sized to the worker's processing capacity. A worker whose handler takes ten milliseconds per task can keep one hundred slots busy at ten thousand tasks per second; a worker whose handler takes one second per task can only keep ten slots busy at ten per second.

BatchSize controls how many tasks each slot tries to claim per Ready frame and how many results coalesce into one ResultBatch (pkg/workerclient/client.go:58). Default is zero, which the client interprets as one — the single-task path. Setting it to a value greater than one enables the batched path: each Ready requests up to BatchSize tasks, the server replies with a TaskBatch of up to that many, and the resulting Results are sent back as one ResultBatch. Mechanically this amortizes the gRPC frame overhead and the Pebble commit cost across the batch. The bench has measured this path; in single-node configurations it offers a meaningful throughput improvement over single-task at the cost of higher per-cycle latency (a result is sent only when the batch is complete). The cost is also that a worker holding a batch of N tasks under one lease has a larger uncommitted-work envelope; if the worker crashes, all N are re-delivered. A recommended starting value is eight to thirty-two for workloads where per-task latency is not critical, and zero for workloads where latency matters.

LeaseSeconds is the lease duration the client requests on each Ready (pkg/workerclient/client.go:48). Zero means "use the server default" (sixty seconds in the standard config, see Config.DefaultLeaseSeconds at pkg/config/config.go:38). Mechanically, the lease is how long the server waits before re-delivering an unacknowledged task. Short leases mean fast recovery from worker crashes; long leases mean a worker with a slow task does not get its task stolen mid-execution. The throughput effect is small — leases do not appear in the commit-path mutex profile — but the operational effect is real. A worker whose handler takes thirty seconds per task should set LeaseSeconds=60 minimum; a worker whose handler takes ten milliseconds is fine with a five- or ten-second lease and benefits from the faster recovery semantics.

IdleBackoff is how long a slot waits before re-sending a Ready when the previous Ready did not yield a task (pkg/workerclient/client.go:63). Default is fifty milliseconds. The throughput effect at saturation is zero (slots are never idle). The effect under partial load is on the order of the backoff interval: a slot that completes a task and finds the queue empty waits fifty milliseconds before checking again. Setting it lower reduces the polling latency at the cost of more wasted Ready frames; setting it higher reduces the frame overhead at the cost of slower drain when work finally arrives. Fifty milliseconds is a reasonable default. Workloads that prioritize latency over throughput can drop it to ten or twenty milliseconds without measurable cost.

The producer client (pkg/producerclient) has fewer knobs in practice, because the producer path is more typically driven by application code than by client configuration. The relevant practice is to use ProduceBatch over Produce when the producer has multiple tasks to submit at once — the batched path amortizes the same gRPC framing and Pebble commit costs that the worker batched path does, on the create side instead of the complete side. The bench harness exercises this path via PHASE6_PROD_BATCH (see internal/bench/profile_full_cycle_test.go:186); a starting value of eight is reasonable for workloads that naturally batch.

RAFT knobs

The RAFT knobs sit in RaftConfig (pkg/config/config.go:128-152). They are most useful as performance levers when the deployment is under stress; on healthy clusters the defaults are usually fine.

HeartbeatMS is how often the leader sends heartbeats to followers. Default in the bench config is fifty milliseconds (raft_grpc_bench_test.go:227). Production defaults vary by deployment; one hundred to two hundred fifty is common. Mechanically, more frequent heartbeats detect leader failures faster but consume more network and CPU. The throughput effect is small but not zero — heartbeat frames share the same TCP connection as AppendEntries, so very aggressive heartbeats compete with the apply traffic. The recommended starting value is one hundred milliseconds for production deployments on a real network, fifty for tests on loopback.

ElectionMS is how long a follower waits without a heartbeat before starting a new election. It should be a multiple of HeartbeatMS — typically ten times. The bench sets it to fifty milliseconds for fast test runs, but production should not. A short election timeout makes the cluster reactive to leader failures but increases the risk of spurious elections under transient network slowness, which cost throughput because the cluster is temporarily without a leader. The recommended starting value is one second for production. Test environments can use fifty milliseconds.

LeaderLeaseMS is the maximum time a leader stays leader without successful heartbeat round-trips to majority. After the lease, the leader steps down voluntarily. It does not appear directly in the throughput profile but it interacts with the election and heartbeat intervals. The bench uses fifty milliseconds. Production typically uses one or two seconds.

CommitMS is the commit batching interval at the raft layer (pkg/config/config.go:151). It controls how aggressively the raft loop coalesces log entries before committing. The bench uses ten milliseconds, which is short. Lower values mean lower latency per commit but smaller commit batches; higher values batch more entries per commit at the cost of latency. The Apply coalescer at internal/raft/db.go:594 does its own batching on top of this, so the CommitMS interaction is subtle. The recommended starting value is ten milliseconds for low-latency workloads, twenty-five to fifty for throughput-oriented workloads where a few extra milliseconds per commit is acceptable.

ApplyTimeoutSeconds is the per-Apply timeout. If a raft Apply does not complete within this window, the call returns an error. The bench uses three seconds. Production should use a value comfortably above the worst-case commit latency on the slowest follower. Five seconds is a safe default for healthy clusters; ten seconds is appropriate for clusters with slower disks.

MuxEnabled collapses every Pebble shard's raft group onto a single TCP listener at BindAddr, demuxed by group ID. The default with this off is one TCP port per shard, which works but expands the port footprint. With it on, the cluster uses one port per node. The throughput effect is small in either direction; the operational effect is to simplify firewall and network topology. The recommended value for new deployments is true. It is not a safe rolling upgrade for existing clusters because the topology changes.

Compile-time constants

Two constants in the codebase set the coalescer behavior. They are not exposed in Config because changing them requires understanding the trade-offs precisely.

maxMergeBatch = 64 at internal/repository/pebble/db.go:122 is the cap on how many concurrent Pebble batches are merged into a single commit. Higher values amortize the commitPipeline mutex over more operations but increase tail latency for late joiners (a writer that arrives just after the coalescer dispatched waits one merge cycle) and increase the merged batch's memory footprint. Sixty-four was picked empirically; the bench has not been shown to benefit from larger values.

commitChanBuf = 1024 at internal/repository/pebble/db.go:128 is the queue depth between commit submitters and the coalescer goroutine. Sized to absorb several commit cycles of in-flight batches at peak load. The bench has not been observed to fill this channel; it is sized conservatively to avoid head-of-line blocking under bursts.

raftMergeBatch = 128 at internal/raft/db.go:149 is the analogous cap for the RAFT apply coalescer. One hundred twenty-eight gives the apply loop room to merge larger windows than the Pebble coalescer, because the raft round-trip is more expensive than a Pebble commit and benefits from larger amortization. The bench has measured thirty to fifty percent improvement at this size versus a baseline without merging.

Changing any of these constants requires recompiling codeQ. The recommendation is not to change them without a profile in hand showing the specific mutex contention they would address.

A short tuning order

For an operator tuning a fresh deployment, the order is: pick numShards first based on the host's core count (one shard per two to three cores, capped at eight). Pick fsyncOnCommit=false unless the durability model demands otherwise, and measure if you flip it on. Pick worker Concurrency to match the worker handler's per-task latency (slots equals desired throughput times per-task latency). Pick BatchSize based on whether per-task latency matters: zero for low-latency workloads, eight to thirty-two for throughput-oriented workloads. Pick RAFT timers from the production-default starting points (one hundred millisecond heartbeat, one second election, ten millisecond commit, five second apply timeout) and adjust only if the cluster shows specific symptoms.

The compile-time constants stay at their defaults until somebody runs a bench that says otherwise.

Where to go next

For the throughput numbers these knobs move, see Performance Single Node Throughput and Performance Multi Shard Scaling. For how to run the benches that produced the published numbers, see Performance Bench Harness. For the RAFT replication layer that the RAFT knobs govern, see Architecture Raft.

Clone this wiki locally