Skip to content

perf(store): shrink the per-key object - box ValueRepr variants (mem rounds 1-2)#262

Merged
ELares merged 3 commits into
mainfrom
perf/mem-round-1
Jun 16, 2026
Merged

perf(store): shrink the per-key object - box ValueRepr variants (mem rounds 1-2)#262
ELares merged 3 commits into
mainfrom
perf/mem-round-1

Conversation

@ELares

@ELares ELares commented Jun 16, 2026

Copy link
Copy Markdown
Owner

First two rounds of the campaign to beat redis 8.8.0 on memory (and not lose speed). Pure layout shrinks of the per-key object, zero behavior change.

  • Round 1: box the 4 collection ValueRepr variants (List/Hash/Set/ZSet -> Box). ValueRepr 72->48, KvObj 112->88, slot 128->104. SSO preserved.
  • Round 2: box the embstr inline buffer (Inline(InlineBuf) -> Inline(Box<[u8]>)). ValueRepr 48->24, KvObj 88->64, slot 104->80. Allocation-parity with redis.

Measured vs redis 8.8.0 (head-to-head, 300k keys, 128B values): bytes-per-key 526.7 -> 386.85 (gap 2.41x -> 1.77x), qps 71.4k -> 77.9k (+9%, smaller slot = better table cache density), tail latency still a big win. Whole-workspace tests green; #![forbid(unsafe_code)] intact.

The perf-gate (A5) runs on this PR and should show the improvement (bytes-per-key fell). docs/bench/OPTIMIZATION_LOG.md is the running tally and records the validated next lever: a single-allocation blob entry in a key-dedup hashbrown::HashTable (the small-value gap, 2.88x at 32B, is structural and needs that).

🤖 Generated with Claude Code

ELares and others added 3 commits June 16, 2026 00:28
…ey slot (round 1)

Memory optimization toward beating redis 8.8.0. ValueRepr was 72 bytes, sized for
its largest variants (InlineBuf 45 and ZSetVal 64), so every string/int key
reserved ~56 B it never used. Box the four collection variants
(List/Hash/Set/ZSet -> Box<...>), keeping Int/Inline/Raw unboxed so the embstr
SSO and the string/int hot path are untouched.

Measured (sizeof): ValueRepr 72->48, KvObj 112->88, table slot 128->104.
Measured (head-to-head vs redis 8.8.0, 300k keys, 128B values): bytes-per-key
526.7 -> 421.86 (-20%; gap 2.41x -> 1.93x), and qps 71.4k -> 77.9k (+9%, the
smaller slot improves table cache density). Zero behavior change; whole-workspace
tests green. See docs/bench/OPTIMIZATION_LOG.md round 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zeke <ezequiel.lares@outlook.com>
…ound 2)

Round 2 toward beating redis 8.8.0. Inline(InlineBuf) (a 45 B in-object SSO
buffer) -> Inline(Box<[u8]>), dropping ValueRepr 48->24, KvObj 88->64, table slot
104->80. Allocation-parity with redis (which also heap-allocates the object).

Measured (head-to-head vs redis 8.8.0, 300k keys): 128B values bytes-per-key
421.86 -> 386.85 (gap 1.93x -> 1.77x), qps steady ~77.6k. Table slack per key
146.8 -> 125.8. Zero behavior change; whole-workspace tests green. InlineBuf
removed.

Logged the key structural finding: the small-value gap (32B: 291 vs 101 = 2.88x)
is dominated by IronCache's ~3 allocations per key + key duplication, which safe
field-shrinks cannot close. The next lever is a single-allocation blob entry in a
key-dedup table (see docs/bench/OPTIMIZATION_LOG.md). Round 2 keeps the safe wins
banked while that larger rewrite is scoped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zeke <ezequiel.lares@outlook.com>
… (round 3 plan)

Research (redis 8.2 kvobj, valkey 8.0/8.1, Dragonfly Dashtable, hashbrown
HashTable, SwissTable/Dash/MemC3/F14) confirms the lever for the small-value gap
and a SAFE Rust path: hashbrown::HashTable<Entry> with key-from-blob hash/eq
closures (no key duplication) + a thin-pointer single-allocation entry
([header|key|value]). Logged in docs/bench/OPTIMIZATION_LOG.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zeke <ezequiel.lares@outlook.com>
@github-actions

Copy link
Copy Markdown

perf-gate (A5)

Same-runner ratchet of HEAD against the merge-base (both rebuilt and measured in this job).
PASS = within the noise band, WARN = a real move inside budget (does not fail), FAIL = past budget in the bad direction.

metric base head delta% band budget verdict
qps_median (peak) 71103.54 70732.66 -0.52% +/-5.09% drop <= 15% PASS
bytes_per_key int 239.99 156.11 -34.95% det rise <= 5% PASS
bytes_per_key embstr 240.09 172.21 -28.27% det rise <= 5% PASS
bytes_per_key raw 496.10 412.22 -16.91% det rise <= 5% PASS

Overall: PASS

  • qps: noisy on shared CI, so the band comes from the base reps spread (floored at 5%); a drop is only a regression past the 15% budget.
  • bytes_per_key: deterministic (allocator-true memmodel), so a tight 5% rise budget; any rise beyond it FAILs.
  • Open-loop tails / criterion micro-benches are reported-not-failed (tail noise is high) and are not part of this ratchet.
  • An intentional perf trade is landed by raising the relevant budget in this PR with a documented reason (CI never auto-commits a baseline).

@ELares ELares merged commit 0ea8f9b into main Jun 16, 2026
12 checks passed
@ELares ELares deleted the perf/mem-round-1 branch June 16, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant