Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions docs/bench/FINDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
<!-- SPDX-License-Identifier: MIT OR Apache-2.0 -->

# IronCache efficiency findings and optimization scope (A6, dated 2026-06-16)

This is the "scope the gap" step of the performance track (task A6): now that the
measurement harness exists (A1 memory model, A2 load generator, A3 reproducible
run, A4 head-to-head, A5 per-PR regression gate), this records where IronCache
actually stands against the bar and scopes the optimization work precisely,
rather than optimizing speculatively. The track's principle holds: you do not
optimize what you have not measured, and you do not rewrite the store waist
blind.

## How these numbers were taken (and their caveats)

An INDICATIVE head-to-head via `scripts/bench/headtohead.sh`:

- IronCache `0.0.0` vs **redis-server 7.2.1** as a wire-compatible STAND-IN. The
published bar is the pinned **valkey-server 9.1.0** (docs/bench/COMPETITORS.md);
Valkey 8.0+ embeds keys/values, so its memory will differ from this 7.2 redis.
- **Unpinned, on a 10-core macOS dev box** (no taskset), with the load generator
co-resident on the same cores as the server. So the THROUGHPUT comparison is
contention-bound and NOT authoritative.
- 300,000 keys, 128-byte values, zipf 0.99, 90% reads, 50 connections, 5s.

The authoritative verdict requires running the same harness on a pinned Linux box
(disjoint server/client cores) against valkey-server 9.1.0. That run is a
CI/dedicated-runner activity; the harness is ready for it.

## Indicative results

| Metric | IronCache | redis 7.2.1 | IronCache / competitor |
| --- | ---: | ---: | ---: |
| bytes-per-key (used_memory delta / N) | 527 | 245 | **2.15x (worse)** |
| qps-per-core (closed-loop, unpinned) | 7151 | 7528 | 0.95x |
| open-loop p50 | 1006 us | 4187 us | 0.24x (better) |
| open-loop p99 | 2513 us | 65663 us | 0.04x (better) |

### What is trustworthy vs not

- **bytes-per-key (~2.1x heavier) is RELIABLE.** It is a deterministic
`used_memory` delta over an identical deterministic populate; it is not
sensitive to pinning, contention, or the co-resident load generator. This is a
real gap and the same direction the A1 memory model predicted.
- **qps-per-core (~parity) is NOT authoritative.** On an unpinned box with the
load generator stealing cores from the server, this is contention/loopback
bound, not a clean server-throughput measurement. IronCache's far lower p50/p99
latency suggests headroom that a pinned run would expose; the pinned Linux run
is needed before drawing a throughput conclusion.

## The memory gap, decomposed (A1 memory model)

The A1 `memmodel` decomposition (object vs table slack) locates the ~2.1x:

1. **Fat per-slot value.** The stored-value type (`KvObj`) is sized for its
largest inline variant, so every hash-table slot reserves that footprint even
for an int or a short string. Measured per-slot footprint is ~160 bytes. Redis
and Valkey keep a pointer-sized slot and put the object behind it (Valkey 8.0+
even embeds small ones in a single allocation), so their per-entry overhead is
much smaller.
2. **Hash-table slack.** The Swiss-table (hashbrown) bucket array runs at up to
7/8 load, so at the operating load factor the slot array contributes ~210
bytes/key of amortized slack on top of the slot's own size. A fatter slot
makes this slack proportionally worse.

Together these dominate the overhead: of ~527 bytes/key for a 128-byte value,
roughly 400 bytes is metadata + slack, versus roughly 120 bytes for redis.

## Optimization scope (prioritized; each its own effort)

These are scoped, NOT executed here: each touches the frozen Store waist or the
index and so needs its own design, PR, and review, targeted against the real
pinned-Linux-vs-valkey numbers. All are now protected by the A5 perf-gate, which
will catch any throughput regression an optimization introduces.

- **L1 (highest impact): shrink the per-slot footprint.** Box the large `KvObj`
variants so the table slot holds a small (near pointer-sized) value, slashing
both the slot size AND the amortized table slack per key. Expected to bring
bytes-per-key toward the Redis/Valkey range. Risk: a pointer indirection on the
read hot path; must be benchmarked against the throughput gate before it lands.
This is the single biggest lever and the recommended first optimization PR.
- **L2: a more compact index.** The DragonflyDB-style Dashtable the README cites
(extendible hashing, far less per-entry metadata than a Swiss table at high
load) would cut the table slack structurally. Larger; later; its own design.
- **L3: load-factor / sizing tuning.** Cheaper than L1/L2 but bounded upside;
only worth it after L1 since L1 changes the slot size the slack is computed on.
- **Throughput: confirm before optimizing.** The indicative parity is likely an
unpinned-co-resident artifact. Run the pinned Linux head-to-head first; if a
real per-core gap appears, the io_uring data path (issue #28, currently
tokio/epoll) is the lever. Do not optimize throughput speculatively while the
measurement says parity.

## Next step

Run `scripts/bench/headtohead.sh` on a pinned Linux runner against valkey-server
9.1.0 for the authoritative verdict, then execute L1 as the first optimization PR
under the A5 perf-gate. The measurement infrastructure (A1 to A5) is complete and
makes that work measurable and regression-safe.
6 changes: 6 additions & 0 deletions scripts/bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

# IronCache benchmark run script

> Where IronCache stands today, and the prioritized optimization scope, are in
> [docs/bench/FINDINGS.md](../../docs/bench/FINDINGS.md) (A6). Headline: memory is
> the real gap (~2.1x heavier per key than redis in an indicative run, driven by a
> fat per-slot value + hash-table slack); throughput needs a pinned-Linux run vs
> valkey to judge.

`scripts/bench/run.sh` is the one scripted invocation that reproduces a published
benchmark run end to end (BENCHMARK.md #8, PR-A3 of the performance track). It builds
the release binaries, boots a real IronCache server, warms the hot keyset, runs three
Expand Down
Loading