Skip to content

fix: resolve HandshakeResend hive test failure#236

Merged
usmansaleem merged 10 commits into
Consensys:masterfrom
usmansaleem:fix/handshake-resend
Jun 9, 2026
Merged

fix: resolve HandshakeResend hive test failure#236
usmansaleem merged 10 commits into
Consensys:masterfrom
usmansaleem:fix/handshake-resend

Conversation

@usmansaleem

@usmansaleem usmansaleem commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Besu was the only discv5 client failing the HandshakeResend hive test. Root cause turned out to be four interacting bugs across initiator and responder paths; review feedback from cursor-bot and Copilot tightened the nonce-tracking design into a single bounded window.

Responder side

  • Resend the existing WHOAREYOU when a second ordinary packet arrives mid-handshakeUnauthorizedMessagePacketHandler. In WHOAREYOU_SENT we used to issue a fresh challenge with a new nonce, invalidating any handshake the initiator had already signed against the first. Now resends the in-flight challenge (mirrors geth handleUnknown, v5_udp.go:831-853). The spec section §"Processing of message packets while a challenge is received" is ambiguous on this point; every interoperable client takes this interpretation.
  • NodeSession.resendFirstOutgoingWhoAreYou() — new helper. Reads the eldest entry via the values iterator rather than get(nonce) because pendingWhoAreYou is a LinkedHashMap with accessOrder=true and get reorders to the tail; without this, repeated resends would cycle through different entries when multiple challenges are pending.

Initiator side

  • Mark the request embedded in the handshake as SENTWhoAreYouPacketHandler. Previously we embedded a pending request's message in the handshake but left its TaskStatus at AWAIT, so NextTaskHandler re-sent it as a duplicate ordinary packet ~1 s after authentication. The duplicate corrupted the freshly-authenticated session.
  • Accept WHOAREYOU iff its nonce is in our recent outbound window AND a request is pendingWhoAreYouPacketHandler. Mandatory AND, not the earlier OR-relaxation. Per spec §"WHOAREYOU is only ever valid as a response to a previously sent request" the pending-request precondition is required; without it, a delayed/replayed WHOAREYOU whose nonce is still in the 8-deep recent window could overwrite live session keys on an established session.
  • Embed the earliest-created pending request in the handshakeNodeSession now backs requestIdStatuses with Collections.synchronizedMap(new LinkedHashMap<>()) instead of ConcurrentHashMap. ConcurrentHashMap iterates in hash-bucket order, so the previous code could embed a non-deterministic request; insertion order ≈ send order ≈ the request the WHOAREYOU's nonce references (spec §189-201). Synchronization is provided by NodeSession's synchronized methods.

Session manager — bounded recent-outbound-nonce window per session

WhoAreYouSessionResolver needs to bind a WHOAREYOU back to the initiator session by the WHOAREYOU's nonce, which mirrors the nonce of an ordinary packet the initiator sent. The original code tracked only the most recent nonce, which broke when several packets were sent before the handshake completed (the responder's WHOAREYOU references the first packet's nonce, not the latest). Naively retaining every nonce instead leaked memory and let stale post-auth WHOAREYOUs re-trigger handshakes.

The PR converges on a single design:

  • NodeSession keeps the last 8 outbound nonces in a LinkedHashSet (MAX_RECENT_OUTBOUND_NONCES). On each new nonce the eldest is evicted from both the per-session set and NodeSessionManager.lastNonceToSession together — so the global map stays bounded at num_sessions × 8.
  • WhoAreYouPacketHandler accepts a WHOAREYOU iff its nonce is in this window AND at least one request is pending. See the Initiator-side bullet above for why both gates are required.
  • On either-gate miss the handler silently ignores the packet per spec — it does not cancel pending requests, so a spurious or stale challenge cannot abort live work.
  • All eviction paths go through the manager. Single-nonce eviction on each generateNonce and bulk eviction on resetHandshakeState (request-timeout path) both call into NodeSessionManager so lastNonceToSession stays consistent with the per-session bounded window — no orphaned mappings linger between handshake-timeout cycles and session delete. removeNoncesForSession uses value-conditional Map.remove(k, v) to be race-safe against a session replacement.
  • On session delete, removeIf(s -> s == removedSession) clears any remaining mappings.

Test plan

  • HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake — unit test for the AWAIT → SENT fix.
  • DiscoveryManagerTest.callResendAfterHandshake — integration test mirroring geth's TestUDPv5_callResend; covers the embedded-request-SENT fix and the recent-nonce-window check.
  • DiscoveryManagerTest.testDoubleUnathorizedPing — integration test for two ordinary packets racing during WHOAREYOU_SENT; covers responder-resend and the multi-nonce session lookup.
  • UnauthorizedMessagePacketHandlerTest.shouldResendFirstExistingWhoAreYouWhenIncomingNonceIsUnknownDuringHandshake — unit test for the responder-resend fix.

Verification:

  • ./gradlew build passes (full suite).
  • discv5 hive: HandshakeResend now passes. Only PingMultiIP still fails, which is an expected IPv6 limitation when running hive against Docker Desktop on macOS.

Closes #234

Four related bugs caused Besu to be the only discv5 client failing the
HandshakeResend hive test.

Bug 1 (initiator): WhoAreYouPacketHandler embedded a pending request's
message in the handshake packet but never updated its TaskStatus from
AWAIT to SENT. NextTaskHandler would then re-send it as a duplicate
ordinary packet one second after authentication.

Bug 2 (responder): UnauthorizedMessagePacketHandler issued a fresh
WHOAREYOU with a new nonce when a second ordinary packet arrived during
WHOAREYOU_SENT state, invalidating the in-progress handshake. The fix
resends the existing challenge instead, mirroring geth's handleUnknown.

Bug 3 (initiator): The WHOAREYOU nonce check in WhoAreYouPacketHandler
rejected any nonce that didn't match lastOutboundNonce, even when the
responder legitimately resent an earlier challenge. The fix only rejects
when there are no pending requests to match against.

Bug 4 (session manager): NodeSessionManager.lastNonceToSession tracked
only the most recent outbound nonce per session, evicting older ones on
each new send. When multiple packets were sent before handshake, the
resent WHOAREYOU's nonce was no longer in the map and WhoAreYouSession-
Resolver discarded the packet as unexpected. The fix retains all nonces
and removes them together when the session is deleted.

Verified against the discv5 hive test suite: HandshakeResend now passes.
The only remaining failure is PingMultiIP, which is an expected IPv6
limitation on Mac with Docker Desktop.
Comment thread src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java
Two follow-up issues flagged on PR Consensys#236:

1. Stale pre-handshake nonces still routed delayed WHOAREYOU after AUTH.
   onSessionLastNonceUpdate accumulates every outbound nonce in
   lastNonceToSession with cleanup only on session delete. After a session
   reached AUTHENTICATED, a late-arriving WHOAREYOU echoing a pre-handshake
   nonce could bind to the live session and either cancel active requests
   or start a spurious re-handshake. Add clearOldNoncesForSession on the
   manager and call it from NodeSession.setState on every AUTHENTICATED
   transition, keeping only the current lastOutboundNonce.

2. resendFirstOutgoingWhoAreYou could resend a different challenge on
   repeated calls. pendingWhoAreYou is a LinkedHashMap with
   accessOrder=true, so resendOutgoingWhoAreYouFor's get(nonce) moved
   the eldest entry to the tail, causing a subsequent call to pick the
   next eldest. Read the entry via the values iterator (no access-order
   side effect) and inline the resend.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make Besu’s discv5 implementation interoperate with the HandshakeResend hive test by fixing handshake resend semantics across responder/initiator paths and improving nonce-to-session correlation during handshakes, with new unit/integration tests to lock in the behavior.

Changes:

  • Responder: resend an in-flight WHOAREYOU when additional ordinary packets arrive mid-handshake, instead of issuing a fresh challenge.
  • Initiator: mark the embedded request as SENT when included in a handshake, and relax WHOAREYOU acceptance beyond just lastOutboundNonce.
  • Session tracking: retain multiple outbound nonces per session for WHOAREYOU session resolution, and add cleanup on authentication.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/test/java/org/ethereum/beacon/discovery/pipeline/handler/UnauthorizedMessagePacketHandlerTest.java Updates the responder-side unit test to assert resend-of-existing-challenge behavior.
src/test/java/org/ethereum/beacon/discovery/HandshakeHandlersTest.java Adds a unit test ensuring embedded requests are marked SENT when used in a handshake.
src/test/java/org/ethereum/beacon/discovery/DiscoveryManagerTest.java Adds an integration test mirroring geth’s resend scenario with two pings before session establishment.
src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java Adds resendFirstOutgoingWhoAreYou() and triggers nonce-map cleanup on AUTHENTICATED.
src/main/java/org/ethereum/beacon/discovery/pipeline/handler/WhoAreYouPacketHandler.java Relaxes nonce validation and marks embedded request SENT after handshake emission.
src/main/java/org/ethereum/beacon/discovery/pipeline/handler/UnauthorizedMessagePacketHandler.java Changes responder handling to resend an existing WHOAREYOU during WHOAREYOU_SENT state.
src/main/java/org/ethereum/beacon/discovery/pipeline/handler/NodeSessionManager.java Keeps multiple outbound nonces per session and adds clearOldNoncesForSession(...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java Outdated
Replaces the AUTHENTICATED-cleanup approach added in the previous commit
with a bounded recent-outbound-nonce window per session. Addresses two
opposing Copilot review comments on PR Consensys#236:

- Unbounded growth post-authentication. The earlier cleanup only fired
  on the AUTHENTICATED transition; every subsequent sendOutgoingOrdinary
  added a nonce to NodeSessionManager.lastNonceToSession that was never
  removed until session delete.
- Loss of lost-handshake recovery. The cleanup kept only the most-recent
  outbound nonce (the handshake's own nonce). If the handshake packet
  dropped, the responder's resent WHOAREYOU referenced an evicted nonce
  and WhoAreYouSessionResolver could no longer bind it.

New design:
- NodeSession tracks the last 8 outbound nonces in a LinkedHashSet
  (MAX_RECENT_OUTBOUND_NONCES). On each new nonce the eldest is evicted
  from both the per-session set and lastNonceToSession at the same time.
- WhoAreYouPacketHandler accepts a WHOAREYOU iff its nonce is in this
  window. Broader than "must equal lastOutboundNonce" (so concurrent
  ordinary packets work, and a responder's resent WHOAREYOU referencing
  an earlier packet is still bindable). Narrower than "any nonce ever
  sent" (no leak, stale post-auth challenges evict naturally).
- On nonce miss, WhoAreYouPacketHandler now silently ignores the packet
  per spec §"WHOAREYOU is only ever valid as a response to a previously
  sent request" rather than cancelling pending requests, so a spurious
  or stale challenge cannot abort live work.
- The hasPendingRequests relaxation introduced earlier is removed; the
  bounded window subsumes its role.

Also strips the stale "Expected to FAIL" Javadoc on
HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake
(third Copilot comment) — the test passes now.
Cursor's HIGH-severity follow-up on PR Consensys#236: after the bounded-window
redesign in b0e6fe1, a delayed WHOAREYOU whose nonce is still in the
8-deep recent window would re-trigger a full handshake on an already
authenticated session, overwriting session keys.

Fix: gate WhoAreYouPacketHandler on BOTH "nonce is in the recent
outbound window" AND "at least one request is pending". This is not
the earlier hasPendingRequests relaxation we removed (that one OR'd
the two as a fallback); the new check ANDs them as mandatory
preconditions. Matches spec §"WHOAREYOU is only ever valid as a
response to a previously sent request".

Truth table:
- first-time handshake / stale-session re-handshake / concurrent pings
  / lost-handshake recovery: recent=true + pending=true → proceed.
- stale duplicate WHOAREYOU after auth (Cursor's case): recent=true
  but pending=false → ignore.

Folded follow-up (same code area): swap requestIdStatuses from
ConcurrentHashMap to Collections.synchronizedMap(new LinkedHashMap<>()).
ConcurrentHashMap iterates in hash-bucket order, so
getFirstAwait/SentRequestInfo() could embed a non-deterministic
request in the handshake instead of the one the WHOAREYOU references
(spec §189-201). Insertion order ≈ send order ≈ the request the
responder challenged against. All call sites of requestIdStatuses are
inside NodeSession.synchronized methods so the swap is thread-safe.

Residual edge case (acknowledged, not fixed): stale WHOAREYOU + a new
unrelated request pending still fires a wasted re-handshake. The
responder converges on the new keys and the new request gets
answered, so this is wasted crypto rather than a correctness break.
A clean fix needs nonce→request tracking; deferred to a follow-up.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit fd751d6. Configure here.

Comment thread src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java Outdated
Cursor's Medium-severity follow-up on PR Consensys#236: NodeSession.resetHandshakeState
cleared the per-session recentOutboundNonces but left the matching entries in
NodeSessionManager.lastNonceToSession. Those orphans linger until session delete
(180 s), so per-session map size could exceed eight and stale nonces still
resolved to the session — although the two-gate check added in fd751d6 already
prevented any handshake on them, the leak and wasted pipeline work were real.

Fix: route the bulk eviction through the manager via a new
NodeSessionManager.removeNoncesForSession(session, nonces) method, the same way
single-nonce eviction in generateNonce already goes through
onSessionLastNonceUpdate. Uses Map.remove(key, value) for race-safe removal so
a session replacement for the same nonce cannot evict the new mapping.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java
Two Copilot follow-ups on PR Consensys#236 commit 000c01e:

1. After fd751d6 swapped requestIdStatuses from ConcurrentHashMap to
   Collections.synchronizedMap(LinkedHashMap), the request-expiration
   callback in createNextRequest still called requestIdStatuses.remove(...)
   directly from the scheduler thread, without holding NodeSession's
   monitor. ConcurrentHashMap tolerated that; LinkedHashMap throws CME
   when another synchronized reader is iterating values().stream().
   Fix: route the callback through the existing private synchronized
   clearRequestInfo(Bytes) helper, which acquires this monitor and
   serialises with iteration in getFirstAwait/SentRequestInfo() and
   cancelAllRequests().

2. NodeSessionManager.onSessionLastNonceUpdate used key-only
   lastNonceToSession::remove for the evicted nonce; the companion
   removeNoncesForSession (added in 000c01e) uses value-conditional
   Map.remove(key, value). Align onSessionLastNonceUpdate to do the
   same — a nonce-collision or delete-then-evict race could otherwise
   drop a fresh mapping owned by a different session.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/test/java/org/ethereum/beacon/discovery/HandshakeHandlersTest.java Outdated
Two Copilot follow-ups on PR Consensys#236 commit 6a5b90b:

1. WhoAreYouPacketHandler did getFirstAwaitRequestInfo().or(
   getFirstSentRequestInfo).orElseThrow(...) after the hasPendingRequests
   gate. The gate and the lookup are separate synchronized calls on
   NodeSession, so the session monitor is released between them; a
   concurrent request-expiration callback (clearRequestInfo, also
   synchronized) can clear the last pending request in that window.
   orElseThrow then fell into the catch path and called
   cancelAllRequests — destructive side effect for a benign race.
   Fix: replace orElseThrow with isEmpty() + ignore-and-return,
   matching the gate-miss behaviour.

2. HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake
   still carried a multi-line comment narrating the historical bug in
   past tense ("This assertion fails today..."). The test passes now;
   reword to a single line stating the invariant the assertion guards.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Copilot follow-up on PR Consensys#236 commit f7c0895: the embedded-request
race-check we added still ran AFTER the two session.setInitiatorKey/
setRecipientKey calls. If the race fired and we returned early, the
session's keys had already been overwritten with new HKDF output that
the peer didn't know about — corrupting subsequent encrypted traffic
on a session whose previous keys were still valid with the peer.

Fix: move the two setInitiatorKey/setRecipientKey calls below the
embeddedRequestOpt.isEmpty() early-return. The HKDF computation
itself stays where it is (pure local computation; cheap to discard
on the race path). The non-race path is unchanged — keys are still
installed before the handshake is sent.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java
…ictedNonce

Handshake packets use generateNonce() for their header nonce, but WHOAREYOU
packets only ever reference *ordinary/random* packet nonces. Counting handshake
nonces against the bounded window wastes a slot and, when full, evicts the
oldest ordinary nonce — which may still be in-flight. If the handshake is then
lost and the responder retransmits WHOAREYOU(Na), the resolver can no longer
bind it to the session, causing a stall.

Fix: add generateHandshakeNonce() that generates a fresh random nonce without
touching recentOutboundNonces or lastNonceToSession; use it in
WhoAreYouPacketHandler. Ordinary-packet nonces remain tracked as before.

Also rename previousNonce → evictedNonce in onSessionLastNonceUpdate to
match the Javadoc description of bounded-window eviction.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@usmansaleem usmansaleem requested a review from lucassaldanha June 4, 2026 05:38
…-window tests

- NodeSession.getRequestInfo: use Optional.ofNullable and reorder null check so a
  response arriving for an already-expired request returns Optional.empty() instead
  of NPE-ing in Optional.of(null)
- NodeSession.getFirstPendingRequestInfo: new helper combining AWAIT+SENT search in
  one synchronized call; used in WhoAreYouPacketHandler to reduce the gate + post-HKDF
  fetch from four synchronized lock acquisitions to two
- NodeSession.resetHandshakeState: pass Set.copyOf(recentOutboundNonces) to the
  manager so the manager iterates a stable snapshot; the live set is cleared immediately
  after and would otherwise be empty by the time the manager reads it
- NodeSessionTest: five new unit tests covering the recentOutboundNonces window
  (add, evict, manager notification, handshake-nonce exclusion, reset cleanup)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment on lines +58 to +61
if (session.getState() == SessionState.WHOAREYOU_SENT) {
session.resendFirstOutgoingWhoAreYou();
return;
}

@usmansaleem usmansaleem Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race window is real at the Java level — sendOutgoingWhoAreYou and setState(WHOAREYOU_SENT) are two separate synchronized calls with no atomicity guarantee across them. However, it isn't exploitable in practice: NettyDiscoveryServerImpl creates a NioEventLoopGroup(1) (single IO thread), and the Reactor subscription in DiscoveryManagerImpl.start() has no publishOn/subscribeOn, so every incoming-packet handler runs synchronously on that one Netty thread. A second UDP datagram isn't dequeued until the first has finished the entire handler chain, making the interleaving impossible.

We considered switching to !pendingWhoAreYou.isEmpty() as suggested, but it introduces a subtler dependency — correctness then relies on pendingWhoAreYou being reliably cleared in every exit path from WHOAREYOU_SENT (authentication and timeout reset), which is true today but is implicit coupling. The state flag is the authoritative signal for "are we mid-handshake?" and is easier to reason about.

If the gap ever becomes a real concern (e.g. the pipeline is parallelised), the right fix would be to make send + setState atomic inside NodeSession.sendOutgoingWhoAreYou rather than changing what the handler checks against. Leaving as-is for now.

@usmansaleem usmansaleem merged commit 8191fff into Consensys:master Jun 9, 2026
5 checks passed
@usmansaleem usmansaleem deleted the fix/handshake-resend branch June 9, 2026 02:53
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants