feat(kvflash): snapshot ledger + QK-pool restore-seed + consume-restored-KV#446
feat(kvflash): snapshot ledger + QK-pool restore-seed + consume-restored-KV#446dusterbloom wants to merge 11 commits into
Conversation
There was a problem hiding this comment.
12 issues found across 24 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:199">
P2: Prefill metric extraction conflates absent prefill_s with 0.0, producing misleading speedup/output when parse_done fails.</violation>
<violation number="2" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P2: Gate result can falsely pass when consume=1 is internally non-deterministic because c1_self is not included in pass/fail conditions.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:286">
P2: `k_extra == 0` incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.</violation>
<violation number="2" location="bench/bitplane_lsh_experiment.py:335">
P2: Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.</violation>
<violation number="3" location="bench/bitplane_lsh_experiment.py:386">
P2: Default output directory is machine-specific and non-portable. Use a repo-relative default path.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:32">
P2: Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:122">
P1: Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets `tq-k3v4nc` (3-bit MSE keys) and `tq-t3nc` (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| |---|---|---|---| | ||
| | FP8 (E4M3/E5M2) KV | ✅ production (FA3 now runs attention *in* FP8) | ✅ production | ✅ production | | ||
| | INT8 KV | ❌ open RFC (#33480/#37319) | limited | ✅ | | ||
| | 4-bit / 3-bit / 2-bit KV | ❌ | ❌ | ❌ | |
There was a problem hiding this comment.
P1: Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets tq-k3v4nc (3-bit MSE keys) and tq-t3nc (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 122:
<comment>Wrong claim that vLLM has no sub-8-bit KV support (❌ in table). vLLM merged TurboQuant (PR #38479) on 2026-04-15 — 2+ months before this document's research date — with presets `tq-k3v4nc` (3-bit MSE keys) and `tq-t3nc` (3-bit keys + values). This invalidates the 'floor at 8-bit' narrative.</comment>
<file context>
@@ -0,0 +1,179 @@
+|---|---|---|---|
+| FP8 (E4M3/E5M2) KV | ✅ production (FA3 now runs attention *in* FP8) | ✅ production | ✅ production |
+| INT8 KV | ❌ open RFC (#33480/#37319) | limited | ✅ |
+| 4-bit / 3-bit / 2-bit KV | ❌ | ❌ | ❌ |
+
+- Industry rationale: "quantizing 16-bit KV to 4-bit or 2-bit … quality issues still remain, making it challenging for practical deployment." (SqueezeBits vLLM-vs-TRT blog.)
</file context>
| parser.add_argument("--k", default="/tmp/phase0_dump/phase0_K.bin") | ||
| parser.add_argument("--q", default="/tmp/phase0_dump/phase0_Q.bin") | ||
| parser.add_argument("--out-dir", | ||
| default="/home/peppi/Dev/lucebox-hub/bench/qwen35moe_dflash/ctxsweep/") |
There was a problem hiding this comment.
P2: Default output directory is machine-specific and non-portable. Use a repo-relative default path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 386:
<comment>Default output directory is machine-specific and non-portable. Use a repo-relative default path.</comment>
<file context>
@@ -0,0 +1,392 @@
+ parser.add_argument("--k", default="/tmp/phase0_dump/phase0_K.bin")
+ parser.add_argument("--q", default="/tmp/phase0_dump/phase0_Q.bin")
+ parser.add_argument("--out-dir",
+ default="/home/peppi/Dev/lucebox-hub/bench/qwen35moe_dflash/ctxsweep/")
+ args = parser.parse_args()
+ run_experiment(args.k, args.q, args.out_dir)
</file context>
| break | ||
|
|
||
| # Spearman rank correlation | ||
| from scipy.stats import spearmanr |
There was a problem hiding this comment.
P2: Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 335:
<comment>Script introduces an undeclared runtime dependency on SciPy. This makes the experiment fail in standard project environments unless SciPy is installed out-of-band.</comment>
<file context>
@@ -0,0 +1,392 @@
+ break
+
+ # Spearman rank correlation
+ from scipy.stats import spearmanr
+ rho_1bit, _ = spearmanr(s_true, s_1bit)
+ rho_2bit, _ = spearmanr(s_true, s_2bit)
</file context>
| if k_extra > 0 and len(rem_scores) > k_extra: | ||
| rem_top_idx = np.where(remaining)[0][ | ||
| np.argpartition(rem_scores, -k_extra)[-k_extra:] | ||
| ] | ||
| else: | ||
| rem_top_idx = np.where(remaining)[0] |
There was a problem hiding this comment.
P2: k_extra == 0 incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 286:
<comment>`k_extra == 0` incorrectly keeps all remaining tokens, invalidating the sink+recent mass-recall numbers at small k. Return an empty extra set when no extra budget is available.</comment>
<file context>
@@ -0,0 +1,392 @@
+ k_extra = max(0, k_total - n_always)
+ # Top k_extra from remaining by approx
+ rem_scores = scores[remaining]
+ if k_extra > 0 and len(rem_scores) > k_extra:
+ rem_top_idx = np.where(remaining)[0][
+ np.argpartition(rem_scores, -k_extra)[-k_extra:]
</file context>
| if k_extra > 0 and len(rem_scores) > k_extra: | |
| rem_top_idx = np.where(remaining)[0][ | |
| np.argpartition(rem_scores, -k_extra)[-k_extra:] | |
| ] | |
| else: | |
| rem_top_idx = np.where(remaining)[0] | |
| if k_extra <= 0: | |
| rem_top_idx = np.array([], dtype=np.int64) | |
| elif len(rem_scores) > k_extra: | |
| rem_top_idx = np.where(remaining)[0][ | |
| np.argpartition(rem_scores, -k_extra)[-k_extra:] | |
| ] | |
| else: | |
| rem_top_idx = np.where(remaining)[0] |
| Per-type vec_dot functions that exist today: | ||
| - `vec_dot_fattn_vec_KQ_f16` (FP16) | ||
| - `vec_dot_fattn_vec_KQ_q4_0` — "Dequantizes and dots using `dp4a`" | ||
| - `vec_dot_fattn_vec_KQ_bf16` |
There was a problem hiding this comment.
P2: Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 32:
<comment>Incomplete list of existing vec_dot FA functions — document claims only f16, q4_0, and bf16 exist, but ggml-cuda fattn-common.cuh also has q4_1, q5_0, q5_1, and q8_0. Missing these 4 real types undermines the claim that the vec_dot set is small and makes the selective-instantiation recommendation less grounded.</comment>
<file context>
@@ -0,0 +1,179 @@
+Per-type vec_dot functions that exist today:
+- `vec_dot_fattn_vec_KQ_f16` (FP16)
+- `vec_dot_fattn_vec_KQ_q4_0` — "Dequantizes and dots using `dp4a`"
+- `vec_dot_fattn_vec_KQ_bf16`
+
+**Why quantized KV cannot use tensor cores:** "tensor cores require dense, regularly-formatted operands. Quantized formats store multiple values packed per element with scale/zero-point metadata. The MMA hardware cannot directly operate on these compressed representations — dequantization must occur first, eliminating the memory efficiency gains." (DeepWiki).
</file context>
|
|
||
| // Persistent pipelined state (initialized once, reused across requests) | ||
| std::unique_ptr<struct PipelinedDecodeState> pipe_state_; | ||
| std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_; |
There was a problem hiding this comment.
P3: Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.h, line 111:
<comment>Unused member introduces dead state and misleading API surface. Remove it until graph-cache wiring is implemented.</comment>
<file context>
@@ -83,13 +96,20 @@ class Qwen35MoeBackend : public Qwen35Backend {
// Persistent pipelined state (initialized once, reused across requests)
std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
+ std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;
+ bool spec_microbench_done_ = false;
bool ensure_pipe_state(int kv_start);
</file context>
…max_ctx The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with --kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the resident pool instead when the full reservation would force experts cold, so experts stay hot at high max_ctx (decouples max_ctx from the expert-placement cliff). A post-init gate disables KVFlash when it is redundant (full KV already fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool that is itself keeping experts hot. The rule is a shared pure helper (common/kvflash_placement.h) so future MoE backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged KV in logical chunk order; header-validated against layout) and a factored for_each_segment() helper. serde uses synchronous get/set and adapts to the pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning (pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the ensure_free_block + reselect protections; empty by default (byte-identical non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard reject, pinning, deadlock guard, reset.
…r KVFlash Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction automatic in alloc_span); the restore residual delta routes through the same chunked path. Pooled snapshot save/restore serializes the pager into the prefix snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn 128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385) and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data. GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool sizes; cross-turn disk restore round-trips losslessly.
…gment Three complexity cuts, no behavior change (GPU sink-recall gate + serde/ placement unit tests green): - merge restore residual's identical snap_pooled/else chunk loops into one (the else ct ternary already subsumes the pooled case) - extract chunked_prefill() shared by generate_impl kvf_paged + restore residual - inline single-caller for_each_segment template into serialize net -25 lines (54 ins / 79 del).
…restore works
At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.
- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
— the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
partial pager blob, and save it (the restore/deserialize path already existed
and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
(bit-identical first k chunks).
131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
…or QK×PR372 composition The library foundation of the snapshot×ledger unification (plan in thoughts/), so the proven QK residency scorer composes with PR#372 across restore at ≥128K. - Phase 1 (kvflash_pager.h): per-chunk ledger in serialize/deserialize — was_resident + qk_score + KV dtype enum; magic bumped KVFLASH1 (old blobs cleanly miss); deserialize re-pages only resident chunks; dtype-guard closes the latent equal-rowsize swap trap. Unit-tested (ledger round-trip). - Phase 2 (kvflash_qk.h): rebuild/seed the QK pool from the restored ledger so the scorer is warm on turn N+1 instead of scoring every restored chunk as missing(-2.0). Unit-tested (8 new checks, restored scores != missing). - Research/evidence: phase0_bitplane_lsh (the SimHash-on-quant-bits kill-test — surprise: MSB ρ=0.871 vs true QK, refutes "≈random", but modest given diffuse attention; sign bit carries the ranking); tbq4/tq3 fast-FA prior art. Phase 3 (consume restored KV instead of re-prefill — VALIDATED: 36.5x warm prefill, AR greedy bit-identity PASS, binary 0b70418a) is preserved as a patch (/tmp/b_phase23_plus_blockerA_*.patch); its qwen35_backend.cpp integration is interleaved with an uncommitted CUDA-graph blocker-A draft and will land after a clean un-interleave.
…efill (opt-in) Pooled restore consumes the deserialized KV for the chunk-aligned prefix [0, snap_pos) and prefills only the suffix [snap_pos, prompt_len), behind KVFLASH_RESTORE_CONSUME (default 0 = legacy re-prefill). Validated: turn-2 prefill 36.9x faster (36.9s->1.0s) at ~35K tokens, with greedy AR output token-for-token IDENTICAL to the re-prefill path. Completes the snapshot x ledger x QK-pool composition (Phases 1-3). KNOWN CEILING: above ~35K tokens the AR output DIVERGES from full re-prefill (reused pooled KV differs from recompute once the 8192-pool evicts at scale). Do NOT enable default-on for deep context until that divergence is root-caused (acceptable KV-reuse near-tie flip vs real seam bug). Default-0 keeps production on the safe re-prefill path.
…; enable consume default-on The consume-restored-KV path zero-padded kvflash_history_ for the restored prefix, poisoning the drafter residency scorer under DFLASH_KVFLASH+draft+qk-policy. Reconstruct it from the Phase-1 ledger scores so the drafter sees correct residency. Validated under the production spec-decode path: needle retrieved + drafter accept healthy at 64K/114K under consume. KVFLASH_RESTORE_CONSUME now defaults on (env=0 force-disables). Validation (35B-A3B-Q3_K_XL + dflash drafter + kvflash-policy=qk + q4_0 KV): ctx | C0 needle | C1 needle | C0 accept | C1 accept | C0 t3_s | C1 t3_s | speedup 64K | RETRIEVED | RETRIEVED | 10.9% | 10.9% | 132.7 | 0.2 | 663x 114K | RETRIEVED | RETRIEVED | 10.9% | 10.9% | 165.7 | 1.7 | 97x
…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
4670ae8 to
17e97d9
Compare
restore-consume re-pooled QK chunks from live cache; evicted chunks (block_of=-1) scored as missing -> ~77% of restored prefix dropped -> hallucination. Reconstruct pooled-K from the deserialized ledger host bytes (pool_chunk_host) and set kvflash_qk_pooled_upto_=restored_sealed so only true suffix chunks re-pool. Restores QK warm-restore correctness (charbench code_complete 73.6->85.2% accept, tool_call 29.2->53.1%).
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1511">
P1: Skipping repool when `has(c)` is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const int ct = kvflash_pager_.chunk_tokens(); | ||
| const int sealed = committed / ct; | ||
| for (int c = kvflash_qk_pooled_upto_; c < sealed; c++) { | ||
| if (kvflash_qk_pool_.has(c)) continue; // keep restored/host-pooled keys |
There was a problem hiding this comment.
P1: Skipping repool when has(c) is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 1511:
<comment>Skipping repool when `has(c)` is true can preserve stale pooled-K vectors across requests. That makes residency scoring use old chunk content and can evict/retain wrong chunks after warm restore.</comment>
<file context>
@@ -1488,6 +1508,7 @@ void Qwen35Backend::kvflash_qk_pool_to(int committed) {
const int ct = kvflash_pager_.chunk_tokens();
const int sealed = committed / ct;
for (int c = kvflash_qk_pooled_upto_; c < sealed; c++) {
+ if (kvflash_qk_pool_.has(c)) continue; // keep restored/host-pooled keys
const int blk = kvflash_pager_.block_of(c);
if (blk < 0 || !kvflash_qk_pool_.pool_chunk(cache_.attn_k, blk, ct, c)) {
</file context>
Wave-1 KVFlash deep-context — the warm-prefill payoff.
New commits: snapshot ledger + QK-pool restore-seed (f457afb, lib+tests); consume-restored-KV on pooled restore instead of re-prefill (cfbf8ed, env KVFLASH_RESTORE_CONSUME, default-on); drafter-history reconstruction from ledger (14b295e).
Stacks on: #445 (cache-persistence), then #430 core, #429 pager, #428 reservation. Lower commits in this diff are those predecessors; review/merge in order — this rebases to drop them as they land.
Reviewer caveat (correctness): KVFLASH_RESTORE_CONSUME=1 output diverges from the no-consume AR path between ~35K and ~60K tokens. Root cause: chunk-to-block placement permutation changes the FlashAttention key-reduction order, FP non-associativity perturbs the logits, and a near-tie greedy argmax flips. See bench/abc_cache_harness/phase3_divergence_rootcause.md. It is an accepted reuse-vs-recompute flip, NOT a correctness bug — but output is NOT bit-identical at long ctx and there is NO agentic gate yet.
The ~0.1s warm-turn speedup is synthetic NIAH single-needle, not a published result. A real-CLI multi-turn output-equivalence capture is the open follow-up before defaulting this on for any widely-read claim.