Skip to content

Spec: Streaming Prover#1629

Draft
sashafrolov wants to merge 55 commits into
a16z:mainfrom
sashafrolov:feat/streaming-jolt
Draft

Spec: Streaming Prover#1629
sashafrolov wants to merge 55 commits into
a16z:mainfrom
sashafrolov:feat/streaming-jolt

Conversation

@sashafrolov

Copy link
Copy Markdown
Contributor

Draft spec for the streaming Jolt prover. See specs/jolt-streaming-prover.md.

@github-actions github-actions Bot added spec Tracking issue for a feature spec implementation PR contains implementation of a spec labels Jun 18, 2026
@sashafrolov

Copy link
Copy Markdown
Contributor Author

@claude analyze

@0xAndoroid 0xAndoroid added the claude-spec-review-request Triggers Claude spec analysis label Jun 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Claude spec review session started: https://claude.ai/code/session_0113TJTBYW1ukcB76q2vvJbB

Copy link
Copy Markdown
Collaborator

Spec Analysis: Streaming Prover (spec @ 5a8a631)

Dimension Score Gap
Goal 0.92 Clear — crisp objective, Intent and Design agree
Constraints 0.82 Per-stage "strictly decreases" RAM gate vs. the out-of-scope RAM/IO term needs reconciling
Success Criteria 0.72 Byte-identity oracle (jolt-core vs. modular-streaming-off) underspecified
Spec–Codebase Consistency 0.95 All file/type/invariant claims verified accurate
Ambiguity 16%

Status: Questions remain — 1 blocking item to resolve before implementation.

I verified the spec's codebase claims against the PR branch (65ff5a8) and jolt-eval on main. Everything checks out: crates/jolt-prover/src/stages/stage0stage8 (9 stages); cpu/schedule/mod.rs defining StreamingSchedule/HalfSplitSchedule/LinearOnlySchedule; cpu/sumcheck/{kernels,mod.rs} with no schedule consumer; StreamingSumcheckWindow/LinearSumcheckStage traits in jolt-core/src/subprotocols/streaming_sumcheck.rs; streaming.rs in jolt-dory/jolt-witness; LazyTraceIterator in tracer/src/lib.rs; jolt-prover/Cargo.toml's zk/field-inline forwarding pattern; and the referenced invariants (soundness, split_eq_bind, field_mul_scalar) and objectives (prover_time_sha2_chain_100, prover_time_fibonacci_100). No factual contradictions. The modular kernel layer indeed has no SumcheckInstanceProver/SumcheckInstanceParams traits, as claimed.

Questions:

1. [Blocking — Success Criteria] The streaming_proof_byte_identical invariant asserts whole-proof byte equality between the modular streaming prover and jolt-core. Two things make this oracle choice ambiguous as written:

  • I found no existing whole-proof byte-identity test between the modular prover and jolt-core. The established parity mechanism in jolt-prover-harness compares at component granularityComparisonTarget::{CoreCommitments, CoreStageOutput, CoreOpeningClaims, CoreProofShape} (src/parity.rs, plus the frontier_stage0..8 tests) — not serialized proof bytes. So it's unclear whether modular streaming-off is already byte-identical to jolt-core today.
  • Using jolt-core as the oracle couples two properties: modular-vs-core parity and streaming-on-vs-off correctness. The most direct oracle that isolates the streaming change is the modular prover with streaming off (same crate stack, same serialization), which the streaming flag's "computation unchanged" criterion already pins down.

Candidate answer — please confirm one: (a) modular streaming-off is already byte-identical to jolt-core and a whole-proof byte test passes today, so jolt-core is a sound oracle; or (b) the oracle should be modular streaming-off for the byte-identity assertion (isolating streaming), with jolt-core parity remaining the existing component-level check. If (a), it would help to name the existing test; if neither holds yet, the criterion may need to drop to component-parity (commitments/claims/shape) rather than whole-proof bytes. This determines exactly what the headline test asserts, so a wrong assumption means the new invariant fails for reasons unrelated to streaming.

Suggestions (non-blocking):

  • Per-stage RAM gate vs. out-of-scope memory term. The acceptance criterion and the Execution "promotion rule" hard-gate each converted stage on within-stage peak RAM that strictly decreases, but Non-Goals exclude "the prover-memory term that is linear in the zkVM's addressable RAM or I/O." A stage whose within-stage peak is dominated by that out-of-scope term may not strictly decrease from streaming the trace alone. Default assumption if unaddressed: the gate is measured against the trace-derived within-stage peak, and for a stage where the non-trace term dominates, "does not increase" counts as passing. Consider stating this explicitly so a legitimately-converted stage isn't blocked (or a regression masked) by the measurement boundary.
  • The modular stack is branch-local. jolt-prover, jolt-witness, jolt-backends, and jolt-prover-harness exist only on feat/streaming-jolt (+~208k lines), not on main. This is consistent with the spec, but worth noting that /implement-spec must run on this branch — a fresh main checkout has none of the target crates. Default: implementation proceeds on feat/streaming-jolt.
  • Harness already partly exists. Execution step 1 says "Stand up the per-stage gate harness (the jolt-prover-harness)", but a substantial harness is already present (parity.rs, core_fixture.rs, frontier_stage0..8 tests, metrics.rs). Default: extend the existing harness (add the byte-identity/shape comparison and per-stage peak-RAM logging on top) rather than build a new one. A one-line nod to the existing infra would prevent duplication.

After updating the spec, re-add the claude-spec-review-request label to re-run this analysis.


Generated by Claude Code

@moodlezoup moodlezoup removed the claude-spec-review-request Triggers Claude spec analysis label Jun 18, 2026 — with Claude
@sashafrolov

Copy link
Copy Markdown
Contributor Author

I fixed the comments (my local claude didn't say these issues), can't modify the tags on the PR myself.

@markosg04 markosg04 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Most of my comments are just me trying to provide useful context and hopefully predict any potential footguns with your approach

Comment thread specs/jolt-streaming-prover.md Outdated
Comment on lines +18 to +22
Space](https://eprint.iacr.org/2025/611): switch every access to the trace
vector to a lazily generated, streaming view so peak prover RAM scales with
`√T` instead of `T`, bounding memory to a few GB for arbitrarily long
executions and making Jolt usable on laptops and other memory-constrained
machines. The feature is gated behind a new compile-time `streaming` Cargo

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we want the data structure for a particular sum-check / prover component to be sqrt(T) or streaming view of the data

3. **Run a whole-prover optimization pass after all 9 stages are converted**,
since per-stage 2× guards can compound — optimize until end-to-end prover
time is within ~2× of the non-streaming baseline.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may wnat to consider adding additional mechanical checks (https://github.com/semgrep/semgrep) to make sure agents don't accidentally modify the verifier or other unwanted things during a slice. Codex/claude are very trained on following a loop of implement, check, find the next step and repeat. you might want to consider commit instructions too per step

Comment thread specs/jolt-streaming-prover.md Outdated
Comment on lines +39 to +42
Route every prover access to the execution trace through a lazily generated,
parallel-consumable streaming view — gated behind a compile-time `streaming`
Cargo feature — so that peak prover RAM scales with `√T` rather than `T` while
the prover emits byte-identical proofs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just pointing this out again, you can probbaly in one shot update the whole spec) you may want to adjust this to be that the prover algorithm either is allocating sqrt space or a streaming view of a linear space datat structure

Comment on lines +50 to +57
- Two streaming-aware core algorithms underpin the feature: (1) a **streaming
Dory commitment** that computes the commitment's vector-matrix products over
trace chunks without materializing the polynomial. The streaming
commitment surface exists today
([crates/jolt-dory/src/streaming.rs](crates/jolt-dory/src/streaming.rs)), but
the opening path is **not yet fully streaming** — part of it still
materializes a witness that should be consumed chunk-wise, which this spec
must eliminate;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the PR with the 'modular' jolt-prover stack (which your work should build on) doens't have the dory commit and vector matrix product sqrt space optimization but jolt-core does (also see: #1632).

While I'm at it: let me clarify that we took jolt-core and 'stripped' it down to jolt-prover-legacy, so that we could finish the cutover of the modular stack and unify to the single jolt-verifier, whereas the other jolt-prover line of work is coming soon which introduces the decoupled model of the witness, prover orchestration, and backends/compute primitives.

Comment on lines +63 to +66
- A `streaming` Cargo feature on `jolt-prover` that forwards to the backend
capability crates (`jolt-backends`, `jolt-witness`, `jolt-dory`), selecting
the streaming code path at compile time via `#[cfg(feature = "streaming")]`
exactly as `zk`/`field-inline` forward today

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider nice generic compile time cfg for other backends too, since this streaming backend is the first 'alternative' to the canonnical cpu we are introducing (gpu backends to soon come), but this is not too important rn, just something to think about

Comment on lines +242 to +247
- **`tracer`** — owns the lazy-trace mechanism (`LazyTraceIterator`,
[tracer/src/lib.rs](tracer/src/lib.rs)). The streaming trace view is built
here; trace ownership stays in `tracer`. The first CPU run is serial, but
subsequent recompute passes run in parallel (rayon), so chunk access must be
parallelizable and chunked streaming must not serialize the prover's hot
loops.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so tracer should hold the actual program emulation implementation (hence the lazy iterator api) while the jolt-witness should provide some generic interface for the streaming witness, where we can implement such a generic api for tracer. Let's try and maintain this decoupled style / generic style so that it becomes simple to later swap the tracer without needing to worry about re-implementing streaming backend stuff

and witness providers. The prover↔witness interface must offer a streaming
oracle, not a fully materialized one, and must allow parallel access
to its witness chunks.
- **`jolt-backends` (`cpu`)** — the CPU streaming compute and the bulk of the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • yea in general, it's ok to separate out most of the streaming primitives as a completely different cpu backend (say, a streaming-backend) if that makes it easier to organize things.

polynomial types. The commitment surface is in place; the remaining work is
the opening path, which still materializes a witness that should instead be
consumed chunk-wise.
- **`jolt-sumcheck`** — the shared sumcheck protocol/verifier-contract crate

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again u may want to use semgrep or other skills / checks to make sure agents don't accidentally touch apis that are not needed to be modified in theory here such as jolt-sumcheck

Comment thread specs/jolt-streaming-prover.md Outdated
Comment on lines +342 to +344
3. **Run a whole-prover optimization pass after all 9 stages are converted**,
since per-stage 2× guards can compound — optimize until end-to-end prover
time is within ~2× of the non-streaming baseline.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would recommend adding a bit more detailed set of instructions or some skills for specifically how to go about an 'optimization step': like exactly how it should instrument it, over what data (say, sha2-chain 2^16 - 2^20 or something). Just make sure that the 'acceptance' rules are strict and hopefully avoid some reward hacking

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you might want to consider using some of thees skills for general development:
https://github.com/multica-ai/andrej-karpathy-skills

I've been using them lately and think they are helpful

Comment thread specs/jolt-streaming-prover.md Outdated
Comment on lines +339 to +340
within-stage peak RAM strictly decreases; confirm per-stage time regression
stays under 2× (an

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use tools like allocative(https://crates.io/crates/allocative) (or roll your own solution with something like size_of) to mechanically check the prover's data structures, and/or standard instrumenting tools to test the RSS. Maybe just checking actual data structures involving T would be better due to unpredictable allocator behavior, though you may need to play around with it and see.

@quangvdao

Copy link
Copy Markdown
Contributor

This spec seems very ambitious and there are a lot of under-specified details. From my past thinking about streaming Jolt, here are the following blockers to getting performance overhead to be reasonable (I still think <2x is way too optimistic):

  • Lazy trace generation means you'll regenerate trace a bunch of times throughout all stages (say 15-40 depending on how many streaming passes you take over each sum-check). The current tracer is just an interpreter and not very optimized, you will definitely feel the pain. You should consider writing a better tracer that compiles the ELF to a native binary (ARM or x86), execute it at native speed, and set up checkpoints from which one can re-expand quickly.
  • Spartan is one sum-check that is rendered streaming today. If you bench its performance, you'll see that it is >2x as soon as you start doing at least 4 streams over the data. There is some perf on the table for the streaming kernels, but just asymptotically you are adding a O(log log n) factor and so expecting at most 2x slowdown may be unrealistic
  • The sparse sum-checks are the hardest ones to stream for. You will just need to look into implementation strategies for those sum-checks currently then generalize that to the streaming setting. I remember encountering several difficult points that I was not able to resolve cleanly e.g. with minimal perf loss

@sashafrolov

Copy link
Copy Markdown
Contributor Author

I updated the spec in response to all of your feedback. I have slightly downscoped this PR to just be an initial rough implementation with current algorithms, and then leaving the optimizations to a few subsequent rounds. In this implementation, all sumchecks will restream the witness logarithmically or log log many times (depending on whether windowed sumcheck applies). I have ideas in subsequent rounds for cutting down the number of witness regenerations and using windowed sumcheck in better ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

implementation PR contains implementation of a spec spec Tracking issue for a feature spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants