fix(execution-verifier): persist verified block progress on exit#145
Open
piersy wants to merge 1 commit into
Open
fix(execution-verifier): persist verified block progress on exit#145piersy wants to merge 1 commit into
piersy wants to merge 1 commit into
Conversation
The execution verifier was entering a CrashLoopBackOff in Kubernetes because it would complete a short block range faster than the 10-second persistence interval, then exit without saving progress. On restart it would read the same stale persisted block, re-process the same range, and exit again in an infinite loop. Changes: - Persist the verified block tracker to the state file on all exit paths (normal completion, task error, and task panic), not just on the background timer. The happy path propagates persist errors; the error paths use best-effort persistence to avoid masking the original error. - Clone state_file and tracker before they are moved into spawned task closures so they remain available at the exit points. - Improve the startup log to print the end block number when defined, or indicate that the verifier is following the head, making it easier to diagnose range vs head-following mode from pod logs.
d1f465d to
f5fdac6
Compare
karlb
reviewed
May 20, 2026
| Some(end) => tracing::info!( | ||
| start_block_number = start_block, | ||
| end_block_number = end, | ||
| "Using start-block with end-block" |
Contributor
There was a problem hiding this comment.
Sounds a bit weird. Maybe "using fixed block range"?
Comment on lines
307
to
+308
| verified_block_store_task.abort(); | ||
| persist_verified_block(tracker, cli.state_file.as_ref()).await?; |
Contributor
There was a problem hiding this comment.
Will this give us a clean abort without joining verified_block_store_task? Should we rather do
Suggested change
| verified_block_store_task.abort(); | |
| persist_verified_block(tracker, cli.state_file.as_ref()).await?; | |
| cancel_token.cancel(); | |
| verified_block_store_task.await?; | |
| persist_verified_block(tracker, cli.state_file.as_ref()).await?; |
| let concurrency_handle = verify_new_heads_concurrency.clone(); | ||
| handles.spawn({ | ||
| let cancel_token = cancel_token.clone(); | ||
| let cloned_cancel_token = cancel_token.clone(); |
Contributor
There was a problem hiding this comment.
Using the name cancel_token_clone would be consistent with the naming in this file. We already had cancel_token_clone and state_file_clone above.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
An execution verifier instance was entering a CrashLoopBackOff in Kubernetes because it would complete a short block range faster than the 10-second persistence interval, then exit without saving progress. On restart it would read the same stale persisted block, re-process the same range, and exit again in an infinite loop.
This PR wont stop the verifier from getting stuck in CrashLoopBackOff but at least the verifier will output a log that makes it clear that it has finished processing its requested block range.
This was the alert - https://clabsco.slack.com/archives/C04NWTCC810/p1774433563332659
Changes: