Consensus: Fail closed on Spark batch verification#1863
Conversation
Summary by CodeRabbit
Walkthrough
ChangesSpark Batch Proof Verification Hardening
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e5cc472 to
629f1d6
Compare
629f1d6 to
a4d9f1a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/batchproof_container.cpp (1)
28-35: 💤 Low valueClarify the conditional logic in
verify()with a comment.The current logic returns
true(success) whenfCollectProofsis true, which skips batch verification. This appears intentional—when proofs are still being collected, there's nothing finalized to verify yet—but the conditional is non-obvious. A brief comment would help future readers understand this is expected behavior rather than a bug.📝 Suggested comment
bool BatchProofContainer::verify() { if (fCollectProofs) { + // Still in collection mode; nothing finalized to verify yet. fCollectProofs = false; return true; } return batch_spark(); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/batchproof_container.cpp` around lines 28 - 35, The verify() method in BatchProofContainer has a non-obvious conditional that returns true when fCollectProofs is true, which skips batch verification. Add a clarifying comment above or within the if (fCollectProofs) block explaining that this early return is intentional because when proofs are still being collected, there is nothing finalized to verify yet, so the function returns success without performing the actual batch verification that happens in the batch_spark() call.src/validation.cpp (2)
2237-2246: ⚡ Quick winDocument the aborting verification boundary.
VerifyPendingSparkBatchis now a cross-layer helper with non-obvious side effects: it verifies pending proofs, aborts the node, starts shutdown, and mutatesstateon failure. Please add a Doxygen contract that captures the guarded boundary and caller preconditions.Suggested documentation
+/** + * Verify any deferred Spark proof batch before crossing a durable validation boundary. + * + * `@param` state Validation state populated when verification fails. + * `@param` reason Human-readable operation being guarded, used in diagnostics. + * `@return` true if no pending batch exists or verification succeeds; false after aborting the node. + * `@pre` Call before persisting or publishing state that depends on the pending Spark batch. + */ bool VerifyPendingSparkBatch(CValidationState& state, const std::string& reason)As per coding guidelines, “Use Doxygen-compatible comments with
@param,@return, and@pretags for function documentation.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/validation.cpp` around lines 2237 - 2246, Add Doxygen documentation to the VerifyPendingSparkBatch function that documents its contract and side effects. Include tags for `@param` describing the CValidationState reference and reason string parameter, `@return` describing the boolean return value indicating success or failure, and `@pre` describing caller preconditions. The documentation should clearly capture that this function verifies pending Spark proofs, may abort the node and trigger shutdown on verification failure, and will mutate the state parameter when verification fails, so callers understand the cross-layer behavior and non-obvious side effects involved.Source: Coding guidelines
3098-3099: 🏗️ Heavy liftAvoid doing Spark batch verification inside the disk-flush lock scope.
FlushStateToDiskholdscs_mainandcs_LastBlockFilefrom Line 3051, so Line 3098 can now run expensive Spark proof verification while blocking validation and block-file operations. Consider restructuring this into a smaller scoped lock/retry boundary: compute that a durable write is needed, release the locks forVerifyPendingSparkBatch, then reacquire/recompute before writing so the fail-closed guarantee is preserved without extending the critical section.As per coding guidelines, “Scope lock regions carefully with braces to minimize critical sections,” and the PR context identifies Spark proof verification as expensive.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/validation.cpp` around lines 3098 - 3099, The VerifyPendingSparkBatch call at line 3098 is executing while holding the cs_main and cs_LastBlockFile locks (acquired at line 3051), which blocks validation and block-file operations during expensive Spark proof verification. Restructure the FlushStateToDisk function to move the VerifyPendingSparkBatch call outside the critical section: first determine whether a durable write is needed with the locks held, then release both cs_main and cs_LastBlockFile, execute VerifyPendingSparkBatch outside the locks, and finally reacquire the locks before performing the actual disk write operations. Ensure the fail-closed guarantee is preserved by recomputing conditions after reacquiring the locks to verify the state has not changed.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/batchproof_container.cpp`:
- Around line 84-86: The LogPrintf call with the message "Spark batch
verification failed." in the conditional block checking !passed is missing a
trailing newline character, while other LogPrintf calls in the same function
include it for consistent formatting. Add a `\n` character at the end of the log
message string in the LogPrintf call to match the formatting convention used
elsewhere in the function.
In `@src/validation.cpp`:
- Around line 3814-3818: The batch proof verification check via
batchProofContainer->verify() is occurring too late in the validation flow.
Currently it happens after ActivateBestChainStep() returns, which means
ConnectBlock() has already written to pblocktree and ConnectTip() has already
updated wallet/listener state. Move the batchProofContainer->verify() check to
occur before these side effects happen, either by moving it earlier in the code
flow before ActivateBestChainStep() is called, or by restructuring the logic to
defer the durable index writes and state updates until after batch verification
succeeds. This ensures that if verification fails, no side effects like
pblocktree writes or wallet state changes have already been applied.
---
Nitpick comments:
In `@src/batchproof_container.cpp`:
- Around line 28-35: The verify() method in BatchProofContainer has a
non-obvious conditional that returns true when fCollectProofs is true, which
skips batch verification. Add a clarifying comment above or within the if
(fCollectProofs) block explaining that this early return is intentional because
when proofs are still being collected, there is nothing finalized to verify yet,
so the function returns success without performing the actual batch verification
that happens in the batch_spark() call.
In `@src/validation.cpp`:
- Around line 2237-2246: Add Doxygen documentation to the
VerifyPendingSparkBatch function that documents its contract and side effects.
Include tags for `@param` describing the CValidationState reference and reason
string parameter, `@return` describing the boolean return value indicating success
or failure, and `@pre` describing caller preconditions. The documentation should
clearly capture that this function verifies pending Spark proofs, may abort the
node and trigger shutdown on verification failure, and will mutate the state
parameter when verification fails, so callers understand the cross-layer
behavior and non-obvious side effects involved.
- Around line 3098-3099: The VerifyPendingSparkBatch call at line 3098 is
executing while holding the cs_main and cs_LastBlockFile locks (acquired at line
3051), which blocks validation and block-file operations during expensive Spark
proof verification. Restructure the FlushStateToDisk function to move the
VerifyPendingSparkBatch call outside the critical section: first determine
whether a durable write is needed with the locks held, then release both cs_main
and cs_LastBlockFile, execute VerifyPendingSparkBatch outside the locks, and
finally reacquire the locks before performing the actual disk write operations.
Ensure the fail-closed guarantee is preserved by recomputing conditions after
reacquiring the locks to verify the state has not changed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5a7a95f0-1604-4511-a070-ca6ac125bb75
📒 Files selected for processing (5)
src/batchproof_container.cppsrc/batchproof_container.hsrc/init.cppsrc/validation.cppsrc/validation.h
| if (!passed) { | ||
| LogPrintf("Spark batch verification failed."); | ||
| throw std::invalid_argument("Spark batch verification failed, please run Firo with -reindex -batching=0"); | ||
| return false; |
There was a problem hiding this comment.
Missing newline in log message.
The failure log message on line 85 is missing a trailing \n, unlike the other LogPrintf calls in this function (lines 55 and 90).
🔧 Suggested fix
if (!passed) {
- LogPrintf("Spark batch verification failed.");
+ LogPrintf("Spark batch verification failed.\n");
return false;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!passed) { | |
| LogPrintf("Spark batch verification failed."); | |
| throw std::invalid_argument("Spark batch verification failed, please run Firo with -reindex -batching=0"); | |
| return false; | |
| if (!passed) { | |
| LogPrintf("Spark batch verification failed.\n"); | |
| return false; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/batchproof_container.cpp` around lines 84 - 86, The LogPrintf call with
the message "Spark batch verification failed." in the conditional block checking
!passed is missing a trailing newline character, while other LogPrintf calls in
the same function include it for consistent formatting. Add a `\n` character at
the end of the log message string in the LogPrintf call to match the formatting
convention used elsewhere in the function.
| if (!batchProofContainer->verify()) { | ||
| return AbortNode(state, | ||
| "Spark batch verification failed", | ||
| _("Spark batch verification failed. Please restart with -reindex -batching=0 to identify the invalid Spark spend.")); | ||
| } |
There was a problem hiding this comment.
Move the failure check before post-connect side effects escape.
Line 3814 verifies the batch only after ActivateBestChainStep() returns; by then ConnectBlock() has already issued pblocktree index writes, and ConnectTip()/this scope have already updated tip-dependent wallet/listener state. If verify() fails here, the node aborts but those side effects can already reflect the invalid Spark batch. Move the checked batch boundary before those durable/index and notification side effects, or defer those side effects until the batch result is known.
Based on PR objectives, pending Spark batches must be verified before validation progress is allowed to advance durably.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/validation.cpp` around lines 3814 - 3818, The batch proof verification
check via batchProofContainer->verify() is occurring too late in the validation
flow. Currently it happens after ActivateBestChainStep() returns, which means
ConnectBlock() has already written to pblocktree and ConnectTip() has already
updated wallet/listener state. Move the batchProofContainer->verify() check to
occur before these side effects happen, either by moving it earlier in the code
flow before ActivateBestChainStep() is called, or by restructuring the logic to
defer the durable index writes and state updates until after batch verification
succeeds. This ensures that if verification fails, no side effects like
pblocktree writes or wallet state changes have already been applied.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4d9f1a1b0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| bool fPeriodicFlush = mode == FLUSH_STATE_PERIODIC && nNow > nLastFlush + (int64_t)DATABASE_FLUSH_INTERVAL * 1000000; | ||
| // Combine all conditions that result in a full cache flush. | ||
| bool fDoFullFlush = (mode == FLUSH_STATE_ALWAYS) || fCacheLarge || fCacheCritical || fPeriodicFlush || fFlushForPrune; | ||
| if ((fDoFullFlush || fPeriodicWrite) && !VerifyPendingSparkBatch(state, "flushing block index or chainstate")) |
There was a problem hiding this comment.
Defer Spark batch checks until the tip is active
When this guard is reached from ConnectTip() via FlushStateToDisk(state, FLUSH_STATE_IF_NEEDED), it runs before UpdateTip(pindexNew, chainparams) advances chainActive (checked src/validation.cpp:3421 and :3430). The batch verifier rebuilds cover sets through CSparkState::GetCoinSet(), which caps them using the current chainActive.Height() (src/spark/state.cpp:1478-1483), so a cache-critical flush while connecting an old Spark block can verify against a one-block-stale active height and reject a valid spend whose cover-set reference is the newest confirmed set for the block being connected. This makes valid reindex/IBD runs abort depending on cache pressure; run the forced batch check only after the new tip is active, or rebuild the cover set from the block context being flushed.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| bool BatchProofContainer::verify_pending() { | ||
| finalize(); |
There was a problem hiding this comment.
Don't promote proofs from failed block connections
verify_pending() can be called from shutdown or later flush paths after ConnectBlock() has returned early. Since ConnectBlock() only calls batchProofContainer->finalize() at its successful end (src/validation.cpp:2971), an old block that collected Spark spends and then failed a later validation check leaves those spends in tempSparkTransactions; this unconditional finalize() promotes proofs from the rejected block into the global pending batch. A subsequent forced verification can then abort the node because of a proof that was never part of an accepted block, so stale temp proofs should be cleared on failed connections or only finalized by the successful connect path.
Useful? React with 👍 / 👎.
| bool fPeriodicFlush = mode == FLUSH_STATE_PERIODIC && nNow > nLastFlush + (int64_t)DATABASE_FLUSH_INTERVAL * 1000000; | ||
| // Combine all conditions that result in a full cache flush. | ||
| bool fDoFullFlush = (mode == FLUSH_STATE_ALWAYS) || fCacheLarge || fCacheCritical || fPeriodicFlush || fFlushForPrune; | ||
| if ((fDoFullFlush || fPeriodicWrite) && !VerifyPendingSparkBatch(state, "flushing block index or chainstate")) |
There was a problem hiding this comment.
Move verification before writing the pruned flag
When pruning is triggered with a pending Spark batch, this guard runs after the pruning branch has already persisted prunedblockfiles via pblocktree->WriteFlag("prunedblockfiles", true) and set fHavePruned (src/validation.cpp:3065-3069). If the Spark batch then fails, the node still leaves that durable flag behind even though it stops before unlinking files or flushing the verified state; on the next startup, fHavePruned && !fPruneMode forces the user into a reindex path (src/init.cpp:1985-1989). To keep the fail-closed boundary from writing validation metadata first, verify the pending batch before the pruning code writes the block-tree flag.
Useful? React with 👍 / 👎.
PR intention
Make deferred Spark batch proof verification fail closed before validation state is persisted as accepted.
This PR is the first security-first step toward improving Spark-era reindex performance. It keeps Spark proof batching available for old blocks, where doing every proof immediately is expensive, while ensuring that any deferred batch must be verified successfully before the node writes durable validation state or marks reindexing complete.
Why this is desirable
Reindexing becomes noticeably slower once Spark-era blocks are reached because Spark spend validation has expensive cover-set and proof-verification work. The existing batching mode is intended to reduce that cost by collecting Spark spend proofs from old blocks and verifying them together instead of doing every proof immediately block by block.
That performance direction is desirable, but the deferred-verification boundary needs to be explicit and fail-closed before further optimization. If a node is allowed to continue writing chainstate, block-index state, or the reindex completion flag while Spark proofs are still pending, then the local durable state can get ahead of the proof-verification result. Even if a later verification failure would eventually stop the node, that is the wrong security shape for consensus-sensitive validation: pending privacy-spend proofs should be treated as unaccepted until the deferred batch has passed.
This PR makes that invariant visible in the validation layer so future reindex-speed work can build on it safely.
Existing behavior this avoids
In the current codebase, when
-batchingis enabled and the node is processing blocks older than one day, Spark spends may be collected into the batch container instead of being fully verified immediately.ConnectBlock()finalizes the per-block collection, andActivateBestChain()only runs the batch verifier when it leaves the old-block batching mode.That means a long reindex or initial sync can accumulate pending Spark proof work while other validation paths continue. Periodic/full flushes and the reindex completion path did not explicitly require those pending Spark proofs to be verified first. The old batch verifier also reported failure by throwing from
batch_spark(), which made failure handling depend on where verification happened rather than on a clear validation-state result.The undesirable part is not batching itself. Batching is useful for performance. The undesirable part is that deferred proof verification was not clearly coupled to the persistence boundaries that make validation progress durable.
Code changes brief
BatchProofContainer::verify_pending()to finalize any collected Spark spends and verify the pending batch explicitly.VerifyPendingSparkBatch()as a validation-layer guard that aborts the node with a recovery hint if pending Spark proofs fail.Security and correctness notes
This PR does not change Spark consensus rules, proof construction, cover-set selection, or the underlying Spark verifier. It only changes when a deferred batch must be forced to a verification result and how that result is propagated.
The failure mode is intentionally fail-closed: if a pending Spark batch fails verification, the node aborts instead of persisting validation progress past that point. The recovery hint remains to restart with
-reindex -batching=0, which forces block-by-block verification and helps identify the invalid Spark spend.The PR also keeps failed batch contents in memory rather than clearing them, so a failure does not silently discard the evidence needed by the current process to remain stopped at the failed deferred-verification boundary.
Intended follow-up speed optimization
After this fail-closed boundary is reviewed and merged, the real reindex-speed work should focus on reducing repeated Spark anonymity-set history work while preserving the same verification result.
The likely next step is to introduce a validation-owned, in-memory Spark batch context/cache for old-block reindex and IBD paths. Instead of having each Spark spend independently walk block history and reconstruct the same cover-set metadata, the batch path should collect the unique Spark cover-set references needed by the pending spends, keyed by the coin group and the referenced accumulator/block hash. The node can then resolve each unique reference once, derive the cover-set size and representation from chain state once, and reuse that derived metadata for all spends in the pending batch.
The batch verifier should likewise avoid rebuilding the same full cover sets repeatedly. For each unique group/reference needed by the pending batch, it should construct or fetch the corresponding derived cover-set data once, then pass those deterministic chain-derived results into the existing Spark proof verifier for the collected spends.
Important security constraints for that follow-up:
-batching=0as the conservative diagnostic path that verifies spends without the deferred batch optimization.That follow-up is expected to address the practical slowdown around Spark activation much more directly than this PR. This PR intentionally prepares the correctness boundary first so the later cache/de-duplication work can optimize without weakening verification integrity.
Scope
This is deliberately minimal. It does not add the larger cover-set metadata cache or broader Spark reindex acceleration yet. Those optimizations should be easier to reason about after this fail-closed boundary is in place.
Validation
git diff --check origin/master...HEAD -- src/batchproof_container.cpp src/batchproof_container.h src/init.cpp src/validation.cpp src/validation.hcmakeandninjaare not available on PATH in this environment.