Add ledger arena for warp sync#1650
Conversation
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
…er-state version Warp serialize (server), import (client), and genesis-arena-init hardcoded ledger_9 (LedgerState v16). A fresh node syncing onto a network governed by an older ledger version (e.g. a real devnet whose genesis+tip arena is v13/ledger_8) then panics: genesis-init can't deserialize the v13 genesis_state, and serve/recover would mismatch. Parse the 'ledger-state[vNN]' tag from the StateKey / genesis_state and dispatch to the matching compiled-in module (v5->ledger_7, v13->ledger_8, v16->ledger_9). Resolves the deferred per-version dispatch (review finding #5). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Justin Frevert <justinfrevert@gmail.com> (cherry picked from commit 85db776)
The recovery monitor sourced peers from SyncingService::peers_info(), which chain_sync.restart() empties on benign post-warp UnknownParent announcements, stalling 1000-scale recovery with "no peers" while libp2p connections were still up. Source candidate peers from the network layer (connected + reserved peers, which survive restarts), falling back to sync peers only if none. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Justin Frevert <justinfrevert@gmail.com> (cherry picked from commit 930aa4e)
…stead of dropping them A warp-synced node never progressed past its target on a real network (observed: pinned at the warp target for 9+ hours against a devnet fork, ~8,400 sync req/s). Three root causes, all fixed here: 1. GatedBlockImport returned Ok(ImportResult::MissingState) for gated blocks. substrate silently drops such blocks while chain_sync's best_queued_number stays advanced, so peer ancestor searches can only match genesis (the warp gap has no headers) and "common too far behind" re-arms the search forever; peers stuck in AncestorSearch ignore announcements, starving tip AND gap sync permanently. (Returning Err instead is no better: each restart re-issues an identical block request and BlockRequestHandler reputation-bans the node — "Same block request multiple times".) Fix: defer by awaiting the recovery gate inside import_block. This cannot deadlock under the new gate scope (below): a gated block can only exist after the state-sync target imported, and recovery uses only the client + request-response network, never the import queue. 2. The gate covered every !with_state() block, including gap-sync (block-history) blocks, which import with skip_execution and never touch the arena. Gate now defers only blocks that would execute: StateAction::Execute, or ExecuteIfPossible with parent state present. 3. The monitor re-armed the gate and re-downloaded the whole arena on every restart of a post-warp node, because chain_sync reports an active gap sync as warp_sync: Some(DownloadingBlocks). The monitor now decides by arena content at startup via the new midnight_node_ledger::has_ledger_state (cheap root lookup, per-version dispatch): present -> skip, absent -> recover. Validated end-to-end against a local-environment fork of a real devnet ledger_8 snapshot (6 validators, mock authorities): fresh --sync warp node went genesis -> fully synced (state + verified 7.3MB arena + full 554k block history) in ~83s, follows the tip with finality, container restart does not re-gate; 0 import errors, 0 NoLedgerState, 0 peer bans. Assisted-by: Claude:claude-fable-5 Signed-off-by: Justin Frevert <justinfrevert@gmail.com> (cherry picked from commit 9332cf5ed9f471a1e29a3cfceea480e04f25f17b)
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: justinfrevert <81839854+justinfrevert@users.noreply.github.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
|
|
||
| /// Arm the gate: the warp path was taken, so authoring + import must wait for verification. | ||
| pub fn arm(&self) { | ||
| self.recovery_pending.store(true, Ordering::Release); |
There was a problem hiding this comment.
maybe it will simplify the code, if we will be persisting in the database this flag, so we may have support for restart of the node & the same location of data (fetched data & flag), also AuxData::insert_aux supports transactional operations if we put multiple keys together
There was a problem hiding this comment.
node/src/warp_ledger_sync/monitor.rs checks the relevant state key on boot, so we can resume partial states.
| ); | ||
|
|
||
| let blob = Arc::new(blob); | ||
| self.cache = Some((target, blob.clone())); |
There was a problem hiding this comment.
what is the average blob size there? Are we safe to store that in-memory?
There was a problem hiding this comment.
I just pulled it for mainnet, and it's a few hundred MiB.
|
|
||
| fn handle_request(&mut self, payload: &[u8]) -> Result<Vec<u8>, HandleError> { | ||
| let req = LedgerSyncRequest::<B::Hash>::decode(&mut &payload[..])?; | ||
| let blob = self.blob_for(req.target_hash)?; |
There was a problem hiding this comment.
is it expensive to compute the blob for specific hash? Maybe worth to narrow scope of available hashes, like target_hash will be available only for blocks that rotate the session, or even only for the last session bundary that is finalized?
There was a problem hiding this comment.
GRANDPA should already follow that path of session-boundary hops, but maybe to not make it dependent on that, there is a chance of running GRANDPA warp-sync first, or retrieve the session boundary block via runtime api like Session::current_index
There was a problem hiding this comment.
having that also, maybe we can adjust the protocol to not target a block hash but rather a session index?
There was a problem hiding this comment.
It's a good idea for protection. I think with the scope, maybe we can work on this separately. #1723.
| } | ||
| } | ||
|
|
||
| /// Wraps the node's inner [`SyncOracle`] (the `SyncingService`) so AURA reports "still syncing" |
There was a problem hiding this comment.
AURA has internal logic to skip block authorship when it is in major_sync, it is located in the Verifier that is used for block production & block import:
https://github.com/paritytech/polkadot-sdk/blob/8ae2e6a47aef16e392b4a951b7165c87d9f1e75b/substrate/client/consensus/aura/src/import_queue.rs#L157
maybe we can check if that works when warp is active, or set the flag if that is not a case?
There was a problem hiding this comment.
We do use the internal aura check ahead of our check on warp sync status
inner.is_major_syncing() || gate.ledger_recovery_in_progress()Does that make sense for now?
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Overview
Part 1 / ~3 for #1648. Prepares ledger arena storage for warp sync. Adds ledger sync process and handlers for ledger data transfer. (This is covered under the upcoming Q3 statement of work regarding sync performance).
🗹 TODO before merging
📌 Submission Checklist
git commit -s) for the DCO🧪 Testing Evidence
Please describe any additional testing aside from CI:
🔱 Fork Strategy
Links