Skip to content

feat(kvflash): snapshot ledger + QK-pool restore-seed + consume-restored-KV#446

Open
dusterbloom wants to merge 11 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-ledger-consume
Open

feat(kvflash): snapshot ledger + QK-pool restore-seed + consume-restored-KV#446
dusterbloom wants to merge 11 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-ledger-consume

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Wave-1 KVFlash deep-context — the warm-prefill payoff.

New commits: snapshot ledger + QK-pool restore-seed (f457afb, lib+tests); consume-restored-KV on pooled restore instead of re-prefill (cfbf8ed, env KVFLASH_RESTORE_CONSUME, default-on); drafter-history reconstruction from ledger (14b295e).

Stacks on: #445 (cache-persistence), then #430 core, #429 pager, #428 reservation. Lower commits in this diff are those predecessors; review/merge in order — this rebases to drop them as they land.

Reviewer caveat (correctness): KVFLASH_RESTORE_CONSUME=1 output diverges from the no-consume AR path between ~35K and ~60K tokens. Root cause: chunk-to-block placement permutation changes the FlashAttention key-reduction order, FP non-associativity perturbs the logits, and a near-tie greedy argmax flips. See bench/abc_cache_harness/phase3_divergence_rootcause.md. It is an accepted reuse-vs-recompute flip, NOT a correctness bug — but output is NOT bit-identical at long ctx and there is NO agentic gate yet.

The ~0.1s warm-turn speedup is synthetic NIAH single-needle, not a published result. A real-CLI multi-turn output-equivalence capture is the open follow-up before defaulting this on for any widely-read claim.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 issues found across 24 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">

<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:199">
P2: Prefill metric extraction conflates absent prefill_s with 0.0, producing misleading speedup/output when parse_done fails.</violation>

<violation number="2" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P2: Gate result can falsely pass when consume=1 is internally non-deterministic because c1_self is not included in pass/fail conditions.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.h">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.</violation>
</file>

<file name="bench/bitplane_lsh_experiment.py">

<violation number="1" location="bench/bitplane_lsh_experiment.py:286">
P2: `k_extra == 0` incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.</violation>

<violation number="2" location="bench/bitplane_lsh_experiment.py:335">
P2: Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.</violation>

<violation number="3" location="bench/bitplane_lsh_experiment.py:386">
P2: Default output directory is machine-specific and non-portable. Use a repo-relative default path.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:32">
P2: Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:122">
P1: Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets `tq-k3v4nc` (3-bit MSE keys) and `tq-t3nc` (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/moe_hybrid_ffn_eval.cpp Outdated
Comment thread server/src/common/kvflash_qk.h Outdated
|---|---|---|---|
| FP8 (E4M3/E5M2) KV | ✅ production (FA3 now runs attention *in* FP8) | ✅ production | ✅ production |
| INT8 KV | ❌ open RFC (#33480/#37319) | limited | ✅ |
| 4-bit / 3-bit / 2-bit KV | ❌ | ❌ | ❌ |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets tq-k3v4nc (3-bit MSE keys) and tq-t3nc (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 122:

<comment>Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets `tq-k3v4nc` (3-bit MSE keys) and `tq-t3nc` (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.</comment>

<file context>
@@ -0,0 +1,179 @@
+|---|---|---|---|
+| FP8 (E4M3/E5M2) KV | ✅ production (FA3 now runs attention *in* FP8) | ✅ production | ✅ production |
+| INT8 KV | ❌ open RFC (#33480/#37319) | limited | ✅ |
+| 4-bit / 3-bit / 2-bit KV | ❌ | ❌ | ❌ |
+
+- Industry rationale: "quantizing 16-bit KV to 4-bit or 2-bit … quality issues still remain, making it challenging for practical deployment." (SqueezeBits vLLM-vs-TRT blog.)
</file context>

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
Comment thread server/src/qwen35/qwen35_backend.cpp
parser.add_argument("--k", default="/tmp/phase0_dump/phase0_K.bin")
parser.add_argument("--q", default="/tmp/phase0_dump/phase0_Q.bin")
parser.add_argument("--out-dir",
default="/home/peppi/Dev/lucebox-hub/bench/qwen35moe_dflash/ctxsweep/")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Default output directory is machine-specific and non-portable. Use a repo-relative default path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 386:

<comment>Default output directory is machine-specific and non-portable. Use a repo-relative default path.</comment>

<file context>
@@ -0,0 +1,392 @@
+    parser.add_argument("--k", default="/tmp/phase0_dump/phase0_K.bin")
+    parser.add_argument("--q", default="/tmp/phase0_dump/phase0_Q.bin")
+    parser.add_argument("--out-dir",
+                        default="/home/peppi/Dev/lucebox-hub/bench/qwen35moe_dflash/ctxsweep/")
+    args = parser.parse_args()
+    run_experiment(args.k, args.q, args.out_dir)
</file context>

break

# Spearman rank correlation
from scipy.stats import spearmanr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 335:

<comment>Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.</comment>

<file context>
@@ -0,0 +1,392 @@
+            break
+
+    # Spearman rank correlation
+    from scipy.stats import spearmanr
+    rho_1bit, _ = spearmanr(s_true, s_1bit)
+    rho_2bit, _ = spearmanr(s_true, s_2bit)
</file context>

Comment on lines +286 to +291
if k_extra > 0 and len(rem_scores) > k_extra:
rem_top_idx = np.where(remaining)[0][
np.argpartition(rem_scores, -k_extra)[-k_extra:]
]
else:
rem_top_idx = np.where(remaining)[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: k_extra == 0 incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 286:

<comment>`k_extra == 0` incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.</comment>

<file context>
@@ -0,0 +1,392 @@
+            k_extra = max(0, k_total - n_always)
+            # Top k_extra from remaining by approx
+            rem_scores = scores[remaining]
+            if k_extra > 0 and len(rem_scores) > k_extra:
+                rem_top_idx = np.where(remaining)[0][
+                    np.argpartition(rem_scores, -k_extra)[-k_extra:]
</file context>
Suggested change
if k_extra > 0 and len(rem_scores) > k_extra:
rem_top_idx = np.where(remaining)[0][
np.argpartition(rem_scores, -k_extra)[-k_extra:]
]
else:
rem_top_idx = np.where(remaining)[0]
if k_extra <= 0:
rem_top_idx = np.array([], dtype=np.int64)
elif len(rem_scores) > k_extra:
rem_top_idx = np.where(remaining)[0][
np.argpartition(rem_scores, -k_extra)[-k_extra:]
]
else:
rem_top_idx = np.where(remaining)[0]

Per-type vec_dot functions that exist today:
- `vec_dot_fattn_vec_KQ_f16` (FP16)
- `vec_dot_fattn_vec_KQ_q4_0` — "Dequantizes and dots using `dp4a`"
- `vec_dot_fattn_vec_KQ_bf16`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 32:

<comment>Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.</comment>

<file context>
@@ -0,0 +1,179 @@
+Per-type vec_dot functions that exist today:
+- `vec_dot_fattn_vec_KQ_f16` (FP16)
+- `vec_dot_fattn_vec_KQ_q4_0` — "Dequantizes and dots using `dp4a`"
+- `vec_dot_fattn_vec_KQ_bf16`
+
+**Why quantized KV cannot use tensor cores:** "tensor cores require dense, regularly-formatted operands. Quantized formats store multiple values packed per element with scale/zero-point metadata. The MMA hardware cannot directly operate on these compressed representations — dequantization must occur first, eliminating the memory efficiency gains." (DeepWiki).
</file context>


// Persistent pipelined state (initialized once, reused across requests)
std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.h, line 111:

<comment>Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.</comment>

<file context>
@@ -83,13 +96,20 @@ class Qwen35MoeBackend : public Qwen35Backend {
 
     // Persistent pipelined state (initialized once, reused across requests)
     std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
+    std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;
+    bool spec_microbench_done_ = false;
     bool ensure_pipe_state(int kv_start);
</file context>

…max_ctx

The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with
--kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the
resident pool instead when the full reservation would force experts cold, so
experts stay hot at high max_ctx (decouples max_ctx from the expert-placement
cliff). A post-init gate disables KVFlash when it is redundant (full KV already
fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool
that is itself keeping experts hot.

The rule is a shared pure helper (common/kvflash_placement.h) so future MoE
backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration
test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged
KV in logical chunk order; header-validated against layout) and a factored
for_each_segment() helper. serde uses synchronous get/set and adapts to the
pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning
(pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the
ensure_free_block + reselect protections; empty by default (byte-identical
non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard
reject, pinning, deadlock guard, reset.
…r KVFlash

Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts
larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction
automatic in alloc_span); the restore residual delta routes through the same
chunked path. Pooled snapshot save/restore serializes the pager into the prefix
snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain
skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn
128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS
critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385)
and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data.

GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool
sizes; cross-turn disk restore round-trips losslessly.
…gment

Three complexity cuts, no behavior change (GPU sink-recall gate + serde/
placement unit tests green):
- merge restore residual's identical snap_pooled/else chunk loops into one
  (the else ct ternary already subsumes the pooled case)
- extract chunked_prefill() shared by generate_impl kvf_paged + restore residual
- inline single-caller for_each_segment template into serialize

net -25 lines (54 ins / 79 del).
…restore works

At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.

- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
  — the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
  snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
  partial pager blob, and save it (the restore/deserialize path already existed
  and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
  real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
  (bit-identical first k chunks).

131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
…or QK×PR372 composition

The library foundation of the snapshot×ledger unification (plan in thoughts/),
so the proven QK residency scorer composes with PR#372 across restore at ≥128K.

- Phase 1 (kvflash_pager.h): per-chunk ledger in serialize/deserialize —
  was_resident + qk_score + KV dtype enum; magic bumped KVFLASH1 (old blobs
  cleanly miss); deserialize re-pages only resident chunks; dtype-guard closes
  the latent equal-rowsize swap trap. Unit-tested (ledger round-trip).
- Phase 2 (kvflash_qk.h): rebuild/seed the QK pool from the restored ledger so
  the scorer is warm on turn N+1 instead of scoring every restored chunk as
  missing(-2.0). Unit-tested (8 new checks, restored scores != missing).
- Research/evidence: phase0_bitplane_lsh (the SimHash-on-quant-bits kill-test —
  surprise: MSB ρ=0.871 vs true QK, refutes "≈random", but modest given diffuse
  attention; sign bit carries the ranking); tbq4/tq3 fast-FA prior art.

Phase 3 (consume restored KV instead of re-prefill — VALIDATED: 36.5x warm
prefill, AR greedy bit-identity PASS, binary 0b70418a) is preserved as a patch
(/tmp/b_phase23_plus_blockerA_*.patch); its qwen35_backend.cpp integration is
interleaved with an uncommitted CUDA-graph blocker-A draft and will land after a
clean un-interleave.
…efill (opt-in)

Pooled restore consumes the deserialized KV for the chunk-aligned prefix
[0, snap_pos) and prefills only the suffix [snap_pos, prompt_len), behind
KVFLASH_RESTORE_CONSUME (default 0 = legacy re-prefill).

Validated: turn-2 prefill 36.9x faster (36.9s->1.0s) at ~35K tokens, with greedy
AR output token-for-token IDENTICAL to the re-prefill path. Completes the
snapshot x ledger x QK-pool composition (Phases 1-3).

KNOWN CEILING: above ~35K tokens the AR output DIVERGES from full re-prefill
(reused pooled KV differs from recompute once the 8192-pool evicts at scale).
Do NOT enable default-on for deep context until that divergence is root-caused
(acceptable KV-reuse near-tie flip vs real seam bug). Default-0 keeps production
on the safe re-prefill path.
…; enable consume default-on

The consume-restored-KV path zero-padded kvflash_history_ for the restored prefix,
poisoning the drafter residency scorer under DFLASH_KVFLASH+draft+qk-policy.
Reconstruct it from the Phase-1 ledger scores so the drafter sees correct residency.
Validated under the production spec-decode path: needle retrieved + drafter accept
healthy at 64K/114K under consume. KVFLASH_RESTORE_CONSUME now defaults on
(env=0 force-disables).

Validation (35B-A3B-Q3_K_XL + dflash drafter + kvflash-policy=qk + q4_0 KV):
  ctx  | C0 needle | C1 needle | C0 accept | C1 accept | C0 t3_s | C1 t3_s | speedup
  64K  | RETRIEVED | RETRIEVED | 10.9%     | 10.9%     | 132.7   | 0.2     | 663x
  114K | RETRIEVED | RETRIEVED | 10.9%     | 10.9%     | 165.7   | 1.7     | 97x
…uard removal

core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
@dusterbloom dusterbloom force-pushed the pr/kvflash-ledger-consume branch from 4670ae8 to 17e97d9 Compare June 24, 2026 20:24
restore-consume re-pooled QK chunks from live cache; evicted chunks (block_of=-1)
scored as missing -> ~77% of restored prefix dropped -> hallucination.
Reconstruct pooled-K from the deserialized ledger host bytes (pool_chunk_host)
and set kvflash_qk_pooled_upto_=restored_sealed so only true suffix chunks re-pool.
Restores QK warm-restore correctness (charbench code_complete 73.6->85.2% accept,
tool_call 29.2->53.1%).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1511">
P1: Skipping repool when `has(c)` is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

const int ct = kvflash_pager_.chunk_tokens();
const int sealed = committed / ct;
for (int c = kvflash_qk_pooled_upto_; c < sealed; c++) {
if (kvflash_qk_pool_.has(c)) continue; // keep restored/host-pooled keys

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Skipping repool when has(c) is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 1511:

<comment>Skipping repool when `has(c)` is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.</comment>

<file context>
@@ -1488,6 +1508,7 @@ void Qwen35Backend::kvflash_qk_pool_to(int committed) {
     const int ct = kvflash_pager_.chunk_tokens();
     const int sealed = committed / ct;
     for (int c = kvflash_qk_pooled_upto_; c < sealed; c++) {
+        if (kvflash_qk_pool_.has(c)) continue;  // keep restored/host-pooled keys
         const int blk = kvflash_pager_.block_of(c);
         if (blk < 0 || !kvflash_qk_pool_.pool_chunk(cache_.attn_k, blk, ct, c)) {
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant