From 6d0396fbd9b445d35acb2b724457a9a8c0a5b952 Mon Sep 17 00:00:00 2001 From: Zeke Date: Mon, 15 Jun 2026 23:56:49 -0700 Subject: [PATCH] docs(bench): efficiency findings + prioritized optimization scope (A6) The "scope the gap" step of the performance track. With the harness complete (A1-A5), this records where IronCache stands against the bar and scopes the optimization precisely, rather than rewriting the store waist speculatively. Indicative head-to-head (IronCache vs a redis-server 7.2.1 stand-in, unpinned macOS, 300k keys via scripts/bench/headtohead.sh): - bytes-per-key 527 vs 245 = ~2.1x HEAVIER. This is the reliable finding (a deterministic used_memory delta, not contention-sensitive) and the real gap. - qps-per-core ~parity (7151 vs 7528) but contention-bound on an unpinned, co-resident box, so NOT authoritative; IronCache's far lower p50/p99 (1006us vs 4187us; 2.5ms vs 65ms) suggests headroom a pinned run would expose. docs/bench/FINDINGS.md decomposes the memory gap with the A1 model (a fat per-slot KvObj value sized for its largest inline variant, ~160 B/slot, plus ~210 B/key Swiss-table slack at load) and scopes the levers, each as its own future effort now protected by the A5 perf-gate: L1 box the large KvObj variants to shrink the slot (highest impact), L2 a Dashtable-style compact index, L3 load-factor tuning; and for throughput, confirm on pinned Linux vs valkey 9.1.0 before optimizing (io_uring #28 is the lever if a real gap appears). A speculative store rewrite is deliberately deferred to the authoritative bar. README points to the findings. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zeke --- docs/bench/FINDINGS.md | 97 +++++++++++++++++++++++++++++++++++++++++ scripts/bench/README.md | 6 +++ 2 files changed, 103 insertions(+) create mode 100644 docs/bench/FINDINGS.md diff --git a/docs/bench/FINDINGS.md b/docs/bench/FINDINGS.md new file mode 100644 index 0000000..2b6d1e0 --- /dev/null +++ b/docs/bench/FINDINGS.md @@ -0,0 +1,97 @@ + + +# IronCache efficiency findings and optimization scope (A6, dated 2026-06-16) + +This is the "scope the gap" step of the performance track (task A6): now that the +measurement harness exists (A1 memory model, A2 load generator, A3 reproducible +run, A4 head-to-head, A5 per-PR regression gate), this records where IronCache +actually stands against the bar and scopes the optimization work precisely, +rather than optimizing speculatively. The track's principle holds: you do not +optimize what you have not measured, and you do not rewrite the store waist +blind. + +## How these numbers were taken (and their caveats) + +An INDICATIVE head-to-head via `scripts/bench/headtohead.sh`: + +- IronCache `0.0.0` vs **redis-server 7.2.1** as a wire-compatible STAND-IN. The + published bar is the pinned **valkey-server 9.1.0** (docs/bench/COMPETITORS.md); + Valkey 8.0+ embeds keys/values, so its memory will differ from this 7.2 redis. +- **Unpinned, on a 10-core macOS dev box** (no taskset), with the load generator + co-resident on the same cores as the server. So the THROUGHPUT comparison is + contention-bound and NOT authoritative. +- 300,000 keys, 128-byte values, zipf 0.99, 90% reads, 50 connections, 5s. + +The authoritative verdict requires running the same harness on a pinned Linux box +(disjoint server/client cores) against valkey-server 9.1.0. That run is a +CI/dedicated-runner activity; the harness is ready for it. + +## Indicative results + +| Metric | IronCache | redis 7.2.1 | IronCache / competitor | +| --- | ---: | ---: | ---: | +| bytes-per-key (used_memory delta / N) | 527 | 245 | **2.15x (worse)** | +| qps-per-core (closed-loop, unpinned) | 7151 | 7528 | 0.95x | +| open-loop p50 | 1006 us | 4187 us | 0.24x (better) | +| open-loop p99 | 2513 us | 65663 us | 0.04x (better) | + +### What is trustworthy vs not + +- **bytes-per-key (~2.1x heavier) is RELIABLE.** It is a deterministic + `used_memory` delta over an identical deterministic populate; it is not + sensitive to pinning, contention, or the co-resident load generator. This is a + real gap and the same direction the A1 memory model predicted. +- **qps-per-core (~parity) is NOT authoritative.** On an unpinned box with the + load generator stealing cores from the server, this is contention/loopback + bound, not a clean server-throughput measurement. IronCache's far lower p50/p99 + latency suggests headroom that a pinned run would expose; the pinned Linux run + is needed before drawing a throughput conclusion. + +## The memory gap, decomposed (A1 memory model) + +The A1 `memmodel` decomposition (object vs table slack) locates the ~2.1x: + +1. **Fat per-slot value.** The stored-value type (`KvObj`) is sized for its + largest inline variant, so every hash-table slot reserves that footprint even + for an int or a short string. Measured per-slot footprint is ~160 bytes. Redis + and Valkey keep a pointer-sized slot and put the object behind it (Valkey 8.0+ + even embeds small ones in a single allocation), so their per-entry overhead is + much smaller. +2. **Hash-table slack.** The Swiss-table (hashbrown) bucket array runs at up to + 7/8 load, so at the operating load factor the slot array contributes ~210 + bytes/key of amortized slack on top of the slot's own size. A fatter slot + makes this slack proportionally worse. + +Together these dominate the overhead: of ~527 bytes/key for a 128-byte value, +roughly 400 bytes is metadata + slack, versus roughly 120 bytes for redis. + +## Optimization scope (prioritized; each its own effort) + +These are scoped, NOT executed here: each touches the frozen Store waist or the +index and so needs its own design, PR, and review, targeted against the real +pinned-Linux-vs-valkey numbers. All are now protected by the A5 perf-gate, which +will catch any throughput regression an optimization introduces. + +- **L1 (highest impact): shrink the per-slot footprint.** Box the large `KvObj` + variants so the table slot holds a small (near pointer-sized) value, slashing + both the slot size AND the amortized table slack per key. Expected to bring + bytes-per-key toward the Redis/Valkey range. Risk: a pointer indirection on the + read hot path; must be benchmarked against the throughput gate before it lands. + This is the single biggest lever and the recommended first optimization PR. +- **L2: a more compact index.** The DragonflyDB-style Dashtable the README cites + (extendible hashing, far less per-entry metadata than a Swiss table at high + load) would cut the table slack structurally. Larger; later; its own design. +- **L3: load-factor / sizing tuning.** Cheaper than L1/L2 but bounded upside; + only worth it after L1 since L1 changes the slot size the slack is computed on. +- **Throughput: confirm before optimizing.** The indicative parity is likely an + unpinned-co-resident artifact. Run the pinned Linux head-to-head first; if a + real per-core gap appears, the io_uring data path (issue #28, currently + tokio/epoll) is the lever. Do not optimize throughput speculatively while the + measurement says parity. + +## Next step + +Run `scripts/bench/headtohead.sh` on a pinned Linux runner against valkey-server +9.1.0 for the authoritative verdict, then execute L1 as the first optimization PR +under the A5 perf-gate. The measurement infrastructure (A1 to A5) is complete and +makes that work measurable and regression-safe. diff --git a/scripts/bench/README.md b/scripts/bench/README.md index c85e2f1..d6653c1 100644 --- a/scripts/bench/README.md +++ b/scripts/bench/README.md @@ -2,6 +2,12 @@ # IronCache benchmark run script +> Where IronCache stands today, and the prioritized optimization scope, are in +> [docs/bench/FINDINGS.md](../../docs/bench/FINDINGS.md) (A6). Headline: memory is +> the real gap (~2.1x heavier per key than redis in an indicative run, driven by a +> fat per-slot value + hash-table slack); throughput needs a pinned-Linux run vs +> valkey to judge. + `scripts/bench/run.sh` is the one scripted invocation that reproduces a published benchmark run end to end (BENCHMARK.md #8, PR-A3 of the performance track). It builds the release binaries, boots a real IronCache server, warms the hot keyset, runs three