Skip to content

Align corrupted-ledger join regression with snapshot-start behavior#7925

Merged
achamayou merged 3 commits into
copilot/add-testcase-for-issue-6612from
copilot/copilotadd-testcase-for-issue-6612
Jun 5, 2026
Merged

Align corrupted-ledger join regression with snapshot-start behavior#7925
achamayou merged 3 commits into
copilot/add-testcase-for-issue-6612from
copilot/copilotadd-testcase-for-issue-6612

Conversation

Copilot AI commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

The regression test for corrupted copied ledger chunks was asserting the wrong outcome for the issue scenario. For join-from-snapshot, a truncated older uncommitted ledger file should be ignored/truncated during recovery, and the node should still join.

  • Snapshot-based join setup (issue-accurate path)

    • test_add_node_with_corrupted_ledger now creates/uses a committed snapshot before preparing the joiner.
    • Join setup now uses from_snapshot=True with fetch_recent_snapshot=False and explicit snapshots_dir.
  • Corruption target and startup expectation

    • The test now truncates an uncommitted ledger file in the joiner’s main copied ledger directory (preferably one older than the snapshot boundary).
    • Expected behavior changed from “join must fail” to “join must succeed”.
  • Lifecycle cleanup + optional path signal

    • After successful join, the test trusts the node, verifies snapshot startup (startup_seqno != 0), then retires/stops it to keep the network clean.
    • Retains optional log inspection to detect malformed-ledger handling without making it a hard assertion.
network.setup_join_node(
    new_node,
    args.package,
    args,
    from_snapshot=True,
    snapshots_dir=snapshots_dir,
    fetch_recent_snapshot=False,
)

# Corrupt copied uncommitted chunk, then join should still succeed
network.run_join_node(new_node)
network.trust_node(new_node, args)
Original prompt

Modify the existing open PR branch copilot/add-testcase-for-issue-6612 for PR #7666 in microsoft/CCF.

Context:

  • CI job nodes_test in VMSS Virtual C failed with AssertionError('Node 13 with corrupted ledger unexpectedly started').
  • The added regression test is in tests/reconfiguration.py as test_add_node_with_corrupted_ledger.
  • Issue cchost crashes in case a corrupt ledger file is found on a node that is joining the network #6612 is about a joining node crashing when it has a corrupt/truncated uncommitted ledger file while starting from a later snapshot. The corrupt older/incomplete ledger should be ignored/truncated; the node should not crash.

Required change:

  1. Update test_add_node_with_corrupted_ledger so it matches issue cchost crashes in case a corrupt ledger file is found on a node that is joining the network #6612: the node should join successfully from a snapshot despite a corrupted/truncated ledger file in its copied ledger directory.
  2. The test should:
    • Create/copy a valid committed snapshot for the joiner.
    • Set up a joining node with that snapshot and copied ledger (from_snapshot=True, fetch_recent_snapshot=False where appropriate).
    • Corrupt/truncate an uncommitted ledger file in the joiner’s main ledger directory.
    • Start the node and assert it joins successfully, rather than asserting failure.
    • Trust/retire/stop the node to leave the network clean.
    • Optionally inspect logs to ensure the malformed ledger path was exercised.
  3. Remove the incorrect expectation that a corrupted copied ledger should make the join fail when the scenario is specifically join-from-snapshot with an older/incomplete corrupted ledger.
  4. Keep the change focused to the existing PR branch and do not open a new pull request.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The failing job is nodes_test in the VMSS Virtual C bucket, run by .github/workflows/ci.yml in the step:

The actionable failure is in test_add_node_with_corrupted_ledger in tests/reconfiguration.py, which ends with:

  • AssertionError('Node 13 with corrupted ledger unexpectedly started')

Link: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/tests/reconfiguration.py

What the logs show

The corrupted-ledger regression test expects a joiner to fail when replaying a deliberately truncated ledger, but the node started instead. The job log ends with:

  • RuntimeError: ["Failure in reconfiguration: AssertionError('Node 13 with corrupted ledger unexpectedly started')"]

This means the join path is no longer rejecting a corrupted copied ledger during startup.

Root cause

The regression test sets up a join node with:

  • from_snapshot=False
  • fetch_recent_snapshot=False

and truncates the latest uncommitted ledger chunk before network.run_join_node(...).

That should force the new node to replay the copied ledger and fail. Instead, the joiner gets far enough to appear healthy to the test harness.

Given the recent join/startup code, the most likely issue is that truncated ledger recovery is no longer treated as a fatal startup failure in the host/start path, so the node can enter the join loop without fully validating the copied ledger first.

The relevant startup/join logic is in:

  • src/node/node_state.h
    • join startup path and fatal join handling around initiate_join_unsafe
    • startup snapshot / ledger recovery setup

Link: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/src/node/node_state.h

There is also a raft assertion seen elsewhere in the same job:

  • src/consensus/aft/raft.h

Link: https://github.com/microsoft/CCF/blob/3f1a8c1bb7059e63032897fbaca8977a77dbb208/src/consensus/aft/raft.h

But that is not the failure that made the test fail. The failing assertion is specifically the corrupted-ledger join unexpectedly succeeding.


Solution

Make corrupted ledger replay a hard startup failure for joining nodes before they can be considered successfully started.

Recommended fix

In the host/node startup path, when ledger deserialisation detects a truncated or malformed ledger entry during recovery/replay, convert that exception into a fatal startup error and terminate the node process cleanly.

Concretely:

  1. Catch ledger parsing/deserialisation errors during startup ledger replay.
  2. Emit a fatal error message to the host.
  3. Stop startup rather than continuing into the join loop.

Code suggestion

The desired behavior should mirror the existing fatal handling already used for early join TLS/certificate failures in src/node/node_state.h around:

  • NodeState::initiate_join_unsafe
  • fatal error emissio...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Update regression test for issue #6612 in nodes_test Align corrupted-ledger join regression with snapshot-start behavior Jun 5, 2026
Copilot AI requested a review from achamayou June 5, 2026 13:10
@achamayou achamayou marked this pull request as ready for review June 5, 2026 13:17
@achamayou achamayou requested a review from a team as a code owner June 5, 2026 13:17
@achamayou achamayou merged commit dc0dbb6 into copilot/add-testcase-for-issue-6612 Jun 5, 2026
17 of 19 checks passed
@achamayou achamayou deleted the copilot/copilotadd-testcase-for-issue-6612 branch June 5, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants