Skip to content

Add ledger arena for warp sync#1650

Open
justinfrevert wants to merge 23 commits into
mainfrom
warp-ledger-sync
Open

Add ledger arena for warp sync#1650
justinfrevert wants to merge 23 commits into
mainfrom
warp-ledger-sync

Conversation

@justinfrevert

@justinfrevert justinfrevert commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Overview

Part 1 / ~3 for #1648. Prepares ledger arena storage for warp sync. Adds ledger sync process and handlers for ledger data transfer. (This is covered under the upcoming Q3 statement of work regarding sync performance).

🗹 TODO before merging

  • Ready

📌 Submission Checklist

  • All commits are signed off (git commit -s) for the DCO
  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason:
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • Updated AGENTS.md if build commands, architecture, or workflows changed
  • No new todos introduced

🧪 Testing Evidence

Please describe any additional testing aside from CI:

  • Additional tests are provided (if possible)

🔱 Fork Strategy

  • Node Runtime Update
  • Node Client Update
  • Other:
  • N/A

Links

justinfrevert and others added 5 commits June 5, 2026 13:27
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
@justinfrevert justinfrevert marked this pull request as ready for review June 10, 2026 04:59
@justinfrevert justinfrevert requested a review from a team as a code owner June 10, 2026 04:59
justinfrevert and others added 7 commits June 9, 2026 22:41
…er-state version

Warp serialize (server), import (client), and genesis-arena-init hardcoded ledger_9
(LedgerState v16). A fresh node syncing onto a network governed by an older ledger
version (e.g. a real devnet whose genesis+tip arena is v13/ledger_8) then panics:
genesis-init can't deserialize the v13 genesis_state, and serve/recover would mismatch.

Parse the 'ledger-state[vNN]' tag from the StateKey / genesis_state and dispatch to the
matching compiled-in module (v5->ledger_7, v13->ledger_8, v16->ledger_9). Resolves the
deferred per-version dispatch (review finding #5).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
(cherry picked from commit 85db776)
The recovery monitor sourced peers from SyncingService::peers_info(), which
chain_sync.restart() empties on benign post-warp UnknownParent announcements,
stalling 1000-scale recovery with "no peers" while libp2p connections were
still up. Source candidate peers from the network layer (connected + reserved
peers, which survive restarts), falling back to sync peers only if none.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
(cherry picked from commit 930aa4e)
…stead of dropping them

A warp-synced node never progressed past its target on a real network
(observed: pinned at the warp target for 9+ hours against a devnet fork,
~8,400 sync req/s). Three root causes, all fixed here:

1. GatedBlockImport returned Ok(ImportResult::MissingState) for gated
   blocks. substrate silently drops such blocks while chain_sync's
   best_queued_number stays advanced, so peer ancestor searches can only
   match genesis (the warp gap has no headers) and "common too far
   behind" re-arms the search forever; peers stuck in AncestorSearch
   ignore announcements, starving tip AND gap sync permanently.
   (Returning Err instead is no better: each restart re-issues an
   identical block request and BlockRequestHandler reputation-bans the
   node — "Same block request multiple times".)
   Fix: defer by awaiting the recovery gate inside import_block. This
   cannot deadlock under the new gate scope (below): a gated block can
   only exist after the state-sync target imported, and recovery uses
   only the client + request-response network, never the import queue.

2. The gate covered every !with_state() block, including gap-sync
   (block-history) blocks, which import with skip_execution and never
   touch the arena. Gate now defers only blocks that would execute:
   StateAction::Execute, or ExecuteIfPossible with parent state present.

3. The monitor re-armed the gate and re-downloaded the whole arena on
   every restart of a post-warp node, because chain_sync reports an
   active gap sync as warp_sync: Some(DownloadingBlocks). The monitor
   now decides by arena content at startup via the new
   midnight_node_ledger::has_ledger_state (cheap root lookup,
   per-version dispatch): present -> skip, absent -> recover.

Validated end-to-end against a local-environment fork of a real devnet
ledger_8 snapshot (6 validators, mock authorities): fresh --sync warp
node went genesis -> fully synced (state + verified 7.3MB arena + full
554k block history) in ~83s, follows the tip with finality, container
restart does not re-gate; 0 import errors, 0 NoLedgerState, 0 peer bans.

Assisted-by: Claude:claude-fable-5
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
(cherry picked from commit 9332cf5ed9f471a1e29a3cfceea480e04f25f17b)
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
Signed-off-by: justinfrevert <81839854+justinfrevert@users.noreply.github.com>
Comment thread ledger/src/versions/common/mod.rs Outdated
Comment thread ledger/src/versions/common/mod.rs
gilescope
gilescope previously approved these changes Jun 11, 2026
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
justinfrevert and others added 2 commits June 12, 2026 05:35
Signed-off-by: Justin Frevert <justinfrevert@gmail.com>
@justinfrevert justinfrevert enabled auto-merge June 12, 2026 18:39

/// Arm the gate: the warp path was taken, so authoring + import must wait for verification.
pub fn arm(&self) {
self.recovery_pending.store(true, Ordering::Release);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it will simplify the code, if we will be persisting in the database this flag, so we may have support for restart of the node & the same location of data (fetched data & flag), also AuxData::insert_aux supports transactional operations if we put multiple keys together

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node/src/warp_ledger_sync/monitor.rs checks the relevant state key on boot, so we can resume partial states.

Comment thread node/src/warp_ledger_sync/server.rs Outdated
);

let blob = Arc::new(blob);
self.cache = Some((target, blob.clone()));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the average blob size there? Are we safe to store that in-memory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pulled it for mainnet, and it's a few hundred MiB.

Comment thread node/src/warp_ledger_sync/server.rs Outdated

fn handle_request(&mut self, payload: &[u8]) -> Result<Vec<u8>, HandleError> {
let req = LedgerSyncRequest::<B::Hash>::decode(&mut &payload[..])?;
let blob = self.blob_for(req.target_hash)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expensive to compute the blob for specific hash? Maybe worth to narrow scope of available hashes, like target_hash will be available only for blocks that rotate the session, or even only for the last session bundary that is finalized?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRANDPA should already follow that path of session-boundary hops, but maybe to not make it dependent on that, there is a chance of running GRANDPA warp-sync first, or retrieve the session boundary block via runtime api like Session::current_index

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having that also, maybe we can adjust the protocol to not target a block hash but rather a session index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea for protection. I think with the scope, maybe we can work on this separately. #1723.

}
}

/// Wraps the node's inner [`SyncOracle`] (the `SyncingService`) so AURA reports "still syncing"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AURA has internal logic to skip block authorship when it is in major_sync, it is located in the Verifier that is used for block production & block import:
https://github.com/paritytech/polkadot-sdk/blob/8ae2e6a47aef16e392b4a951b7165c87d9f1e75b/substrate/client/consensus/aura/src/import_queue.rs#L157
maybe we can check if that works when warp is active, or set the flag if that is not a case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use the internal aura check ahead of our check on warp sync status

inner.is_major_syncing() || gate.ledger_recovery_in_progress()

Does that make sense for now?

@datadog-official

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants