Skip to content

perf(store): 8-byte tagged-pointer Entry slot - first CLEAR memory win vs redis 8.8.0 (round 5)#265

Merged
ELares merged 1 commit into
mainfrom
perf/tagged-slot
Jun 16, 2026
Merged

perf(store): 8-byte tagged-pointer Entry slot - first CLEAR memory win vs redis 8.8.0 (round 5)#265
ELares merged 1 commit into
mainfrom
perf/tagged-slot

Conversation

@ELares

@ELares ELares commented Jun 16, 2026

Copy link
Copy Markdown
Owner

What

Shrink the per-shard hashbrown::HashTable<Entry> slot from 16 bytes to 8 by turning Entry from a 16-byte enum (Str(Box<[u8]>) fat pointer | Coll(Box<CollEntry>)) into a single 8-byte NonNull<u8> tagged pointer:

  • low bit 0 = a manually-allocated Str thin blob [u32 total_len][round-3 blob] (the length moves into the allocation, leaving a one-word pointer), align 8;
  • low bit 1 = Box::into_raw(Box<CollEntry>).

Both allocations are >= 2-aligned, so the low tag bit is always free. This keeps the Round-3 single-allocation-per-key + no-key-duplication win and adds the slot shrink, roughly halving table_bytes_per_key.

This lifts #![forbid(unsafe_code)] on ironcache-store (authorized) in favor of #![deny(unsafe_op_in_unsafe_fn)]. All unsafe is confined to one heavily documented Entry impl in kvobj.rs (manual alloc/dealloc, strict-provenance tag set/clear via map_addr, the access reconstructions, Drop, Clone); every block carries a // SAFETY: justification. The blob content is still parsed with safe bounds-checked slicing. The Store waist is unchanged; only the rmw type-dispatch in lib.rs swaps the old enum match for obj.as_coll_val_mut() now that Entry is opaque.

Result (the optimization-campaign goal: beat redis 8.8.0 on memory)

metric before (round 3) after (round 5) redis 8.8.0
bytes/key (h2h, 128B, 300k, macOS) 221.5 (parity) 199.69 218.61
ratio vs redis 1.01x 0.91x (CLEAR WIN) -
memmodel table_bytes_per_key 26.2 13.11 -
size_of::<Entry>() 16 8 -

This is the first clear memory win over redis 8.8.0 (round 3 was at parity). Memory bytes-per-key is the reliable metric on any box; the macOS throughput number remains contention-bound and non-authoritative (a clean speed verdict needs pinned Linux).

Soundness (3-lens adversarial review: UB / aliasing / behavioral parity)

Aliasing = SOUND, parity = PRESERVED. The UB lens found and this PR fixes two issues miri's executed paths could not catch:

  • CRITICAL (fixed): the new u32 total-length prefix would truncate for a single value > 4 GiB, so Drop would dealloc with a wrong Layout (UB). Reachable because APPEND grew a value unbounded (no proto-max-bulk-len check). This was a regression the prefix introduced (round 3's Box<[u8]> carried a usize length). Fix: (1) a hard expect in alloc_str_blob so the truncation branch is gone (a controlled panic backstop, not UB); (2) cap APPEND at 512 MB returning the exact Redis error (new ErrorReply::string_exceeds_max), matching Redis checkStringLength and keeping every value < 4 GiB so the backstop is unreachable in practice.
  • HIGH (fixed): the tag scheme needs the Str blob and Box<CollEntry> >= 2-aligned, guarded only by a release-stripped debug_assert. Added two const assertions (STR_ALIGN >= 2, align_of::<CollEntry>() >= 2) so a future alignment-breaking edit fails the build.

Gates

  • 849 tests green; clippy -D warnings, fmt, invariant-lint clean.
  • miri under -Zmiri-strict-provenance clean across the store lib (incl. 8 dedicated Entry unsafe-path tests) AND every integration test (primitives, keyspace, eviction, the four collection in-place suites, watch). The jemalloc accounting test is miri-ignored (FFI not miri-executable; documented, non-UB).
  • The A5 per-PR perf-gate (bytes-per-key + qps ratchet) runs on this PR.

🤖 Generated with Claude Code

…n vs redis 8.8.0 (round 5)

Shrink the per-shard `hashbrown::HashTable<Entry>` slot from 16 bytes to 8 by
turning `Entry` from a 16-byte enum (`Str(Box<[u8]>)` fat pointer | `Coll(Box<
CollEntry>)`) into a single 8-byte `NonNull<u8>` TAGGED POINTER:

- low bit 0 = a manually-allocated Str THIN blob `[u32 total_len][round-3 blob]`
  (the length moves INTO the allocation, so the pointer is one word, not a fat
  ptr+len), align 8;
- low bit 1 = `Box::into_raw(Box<CollEntry>)`.

Both allocations are >= 2-aligned, so the low tag bit is always free. This keeps
the Round-3 single-allocation-per-key + no-key-duplication win and adds the slot
shrink, roughly halving `table_bytes_per_key`.

This LIFTS `#![forbid(unsafe_code)]` on ironcache-store (authorized) in favor of
`#![deny(unsafe_op_in_unsafe_fn)]`. All unsafe is CONFINED to one heavily
documented `Entry` impl in kvobj.rs (manual alloc/dealloc, strict-provenance tag
set/clear via `map_addr`, the access reconstructions, Drop, Clone); every unsafe
block carries a `// SAFETY:` justification. The blob CONTENT is still parsed with
SAFE bounds-checked slicing. The Store waist (ValueRef/RmwEntry/side-traits) is
UNCHANGED; only the rmw type-dispatch in lib.rs swaps the old enum match for
`obj.as_coll_val_mut()` now that `Entry` is opaque.

Result (the optimization-campaign goal: beat redis 8.8.0 on memory):
- head-to-head (128B, 300k keys, macOS): bytes/key 221.5 -> 199.69 vs redis
  218.61 = 0.91x, CLEARLY BELOW redis (Round 3 was parity at 221.5).
- memmodel (allocator-true): table_bytes_per_key 26.2 -> 13.11; size_of::<Entry>
  16 -> 8.

Soundness hardening from a 3-lens adversarial review (UB / aliasing / behavioral
parity). Aliasing = SOUND, parity = PRESERVED; the UB lens found and this fixes
two issues miri's executed paths could not catch:
- CRITICAL: the new u32 total-length prefix would TRUNCATE for a single value
  > 4 GiB, so Drop would dealloc with a wrong Layout (UB). It was reachable
  because APPEND grew a value unbounded (no proto-max-bulk-len check). This was a
  regression the prefix introduced (round 3's Box<[u8]> carried a usize length).
  Fix: (1) a hard `expect` in alloc_str_blob so the truncation branch is gone (a
  controlled panic backstop, not UB); (2) cap APPEND at 512 MB returning the exact
  Redis error (new ErrorReply::string_exceeds_max), matching Redis checkStringLength
  AND keeping every value < 4 GiB so the backstop is unreachable in practice.
- HIGH: the tag scheme needs the Str blob and Box<CollEntry> >= 2-aligned, guarded
  only by a release-stripped debug_assert. Added two `const` assertions
  (STR_ALIGN >= 2, align_of::<CollEntry>() >= 2) so a future alignment-breaking
  edit fails the BUILD instead of silently corrupting the tag.

Gates: 849 tests green; clippy -D warnings, fmt, invariant-lint clean; miri under
-Zmiri-strict-provenance clean across the store lib (incl. 8 dedicated Entry
unsafe-path tests) AND every integration test. The jemalloc accounting test is
miri-ignored (FFI not miri-executable; documented, non-UB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zeke <ezequiel.lares@outlook.com>
@github-actions

Copy link
Copy Markdown

perf-gate (A5)

Same-runner ratchet of HEAD against the merge-base (both rebuilt and measured in this job).
PASS = within the noise band, WARN = a real move inside budget (does not fail), FAIL = past budget in the bad direction.

metric base head delta% band budget verdict
qps_median (peak) 70452.15 70895.86 0.63% +/-5.29% drop <= 15% PASS
bytes_per_key int 58.11 45.02 -22.53% det rise <= 5% PASS
bytes_per_key embstr 58.17 61.07 4.99% det rise <= 5% WARN
bytes_per_key raw 346.28 333.15 -3.79% det rise <= 5% PASS

Overall: WARN

  • qps: noisy on shared CI, so the band comes from the base reps spread (floored at 5%); a drop is only a regression past the 15% budget.
  • bytes_per_key: deterministic (allocator-true memmodel), so a tight 5% rise budget; any rise beyond it FAILs.
  • Open-loop tails / criterion micro-benches are reported-not-failed (tail noise is high) and are not part of this ratchet.
  • An intentional perf trade is landed by raising the relevant budget in this PR with a documented reason (CI never auto-commits a baseline).

@ELares ELares merged commit 600403d into main Jun 16, 2026
12 checks passed
@ELares ELares deleted the perf/tagged-slot branch June 16, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant