Align corrupted-ledger join regression with snapshot-start behavior#7925
Merged
achamayou merged 3 commits intoJun 5, 2026
Conversation
Copilot
AI
changed the title
[WIP] Update regression test for issue #6612 in nodes_test
Align corrupted-ledger join regression with snapshot-start behavior
Jun 5, 2026
achamayou
approved these changes
Jun 5, 2026
dc0dbb6
into
copilot/add-testcase-for-issue-6612
17 of 19 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The regression test for corrupted copied ledger chunks was asserting the wrong outcome for the issue scenario. For join-from-snapshot, a truncated older uncommitted ledger file should be ignored/truncated during recovery, and the node should still join.
Snapshot-based join setup (issue-accurate path)
test_add_node_with_corrupted_ledgernow creates/uses a committed snapshot before preparing the joiner.from_snapshot=Truewithfetch_recent_snapshot=Falseand explicitsnapshots_dir.Corruption target and startup expectation
Lifecycle cleanup + optional path signal
startup_seqno != 0), then retires/stops it to keep the network clean.Original prompt
Modify the existing open PR branch
copilot/add-testcase-for-issue-6612for PR #7666 inmicrosoft/CCF.Context:
nodes_testinVMSS Virtual Cfailed withAssertionError('Node 13 with corrupted ledger unexpectedly started').tests/reconfiguration.pyastest_add_node_with_corrupted_ledger.Required change:
test_add_node_with_corrupted_ledgerso it matches issue cchost crashes in case a corrupt ledger file is found on a node that is joining the network #6612: the node should join successfully from a snapshot despite a corrupted/truncated ledger file in its copied ledger directory.from_snapshot=True,fetch_recent_snapshot=Falsewhere appropriate).The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The failing job is
nodes_testin the VMSS Virtual C bucket, run by.github/workflows/ci.ymlin the step:./tests.sh --timeout 360 --output-on-failure -L bucket_cLink: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/.github/workflows/ci.yml
The actionable failure is in
test_add_node_with_corrupted_ledgerintests/reconfiguration.py, which ends with:AssertionError('Node 13 with corrupted ledger unexpectedly started')Link: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/tests/reconfiguration.py
What the logs show
The corrupted-ledger regression test expects a joiner to fail when replaying a deliberately truncated ledger, but the node started instead. The job log ends with:
RuntimeError: ["Failure in reconfiguration: AssertionError('Node 13 with corrupted ledger unexpectedly started')"]This means the join path is no longer rejecting a corrupted copied ledger during startup.
Root cause
The regression test sets up a join node with:
from_snapshot=Falsefetch_recent_snapshot=Falseand truncates the latest uncommitted ledger chunk before
network.run_join_node(...).That should force the new node to replay the copied ledger and fail. Instead, the joiner gets far enough to appear healthy to the test harness.
Given the recent join/startup code, the most likely issue is that truncated ledger recovery is no longer treated as a fatal startup failure in the host/start path, so the node can enter the join loop without fully validating the copied ledger first.
The relevant startup/join logic is in:
src/node/node_state.hinitiate_join_unsafeLink: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/src/node/node_state.h
There is also a raft assertion seen elsewhere in the same job:
src/consensus/aft/raft.hLink: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/src/consensus/aft/raft.h
But that is not the failure that made the test fail. The failing assertion is specifically the corrupted-ledger join unexpectedly succeeding.
Solution
Make corrupted ledger replay a hard startup failure for joining nodes before they can be considered successfully started.
Recommended fix
In the host/node startup path, when ledger deserialisation detects a truncated or malformed ledger entry during recovery/replay, convert that exception into a fatal startup error and terminate the node process cleanly.
Concretely:
Code suggestion
The desired behavior should mirror the existing fatal handling already used for early join TLS/certificate failures in
src/node/node_state.haround:NodeState::initiate_join_unsafeThis pull request was created from Copilot chat.