fix: resolve HandshakeResend hive test failure#236
Conversation
Four related bugs caused Besu to be the only discv5 client failing the HandshakeResend hive test. Bug 1 (initiator): WhoAreYouPacketHandler embedded a pending request's message in the handshake packet but never updated its TaskStatus from AWAIT to SENT. NextTaskHandler would then re-send it as a duplicate ordinary packet one second after authentication. Bug 2 (responder): UnauthorizedMessagePacketHandler issued a fresh WHOAREYOU with a new nonce when a second ordinary packet arrived during WHOAREYOU_SENT state, invalidating the in-progress handshake. The fix resends the existing challenge instead, mirroring geth's handleUnknown. Bug 3 (initiator): The WHOAREYOU nonce check in WhoAreYouPacketHandler rejected any nonce that didn't match lastOutboundNonce, even when the responder legitimately resent an earlier challenge. The fix only rejects when there are no pending requests to match against. Bug 4 (session manager): NodeSessionManager.lastNonceToSession tracked only the most recent outbound nonce per session, evicting older ones on each new send. When multiple packets were sent before handshake, the resent WHOAREYOU's nonce was no longer in the map and WhoAreYouSession- Resolver discarded the packet as unexpected. The fix retains all nonces and removes them together when the session is deleted. Verified against the discv5 hive test suite: HandshakeResend now passes. The only remaining failure is PingMultiIP, which is an expected IPv6 limitation on Mac with Docker Desktop.
Two follow-up issues flagged on PR Consensys#236: 1. Stale pre-handshake nonces still routed delayed WHOAREYOU after AUTH. onSessionLastNonceUpdate accumulates every outbound nonce in lastNonceToSession with cleanup only on session delete. After a session reached AUTHENTICATED, a late-arriving WHOAREYOU echoing a pre-handshake nonce could bind to the live session and either cancel active requests or start a spurious re-handshake. Add clearOldNoncesForSession on the manager and call it from NodeSession.setState on every AUTHENTICATED transition, keeping only the current lastOutboundNonce. 2. resendFirstOutgoingWhoAreYou could resend a different challenge on repeated calls. pendingWhoAreYou is a LinkedHashMap with accessOrder=true, so resendOutgoingWhoAreYouFor's get(nonce) moved the eldest entry to the tail, causing a subsequent call to pick the next eldest. Read the entry via the values iterator (no access-order side effect) and inline the resend.
There was a problem hiding this comment.
Pull request overview
This PR aims to make Besu’s discv5 implementation interoperate with the HandshakeResend hive test by fixing handshake resend semantics across responder/initiator paths and improving nonce-to-session correlation during handshakes, with new unit/integration tests to lock in the behavior.
Changes:
- Responder: resend an in-flight WHOAREYOU when additional ordinary packets arrive mid-handshake, instead of issuing a fresh challenge.
- Initiator: mark the embedded request as
SENTwhen included in a handshake, and relax WHOAREYOU acceptance beyond justlastOutboundNonce. - Session tracking: retain multiple outbound nonces per session for WHOAREYOU session resolution, and add cleanup on authentication.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/java/org/ethereum/beacon/discovery/pipeline/handler/UnauthorizedMessagePacketHandlerTest.java | Updates the responder-side unit test to assert resend-of-existing-challenge behavior. |
| src/test/java/org/ethereum/beacon/discovery/HandshakeHandlersTest.java | Adds a unit test ensuring embedded requests are marked SENT when used in a handshake. |
| src/test/java/org/ethereum/beacon/discovery/DiscoveryManagerTest.java | Adds an integration test mirroring geth’s resend scenario with two pings before session establishment. |
| src/main/java/org/ethereum/beacon/discovery/schema/NodeSession.java | Adds resendFirstOutgoingWhoAreYou() and triggers nonce-map cleanup on AUTHENTICATED. |
| src/main/java/org/ethereum/beacon/discovery/pipeline/handler/WhoAreYouPacketHandler.java | Relaxes nonce validation and marks embedded request SENT after handshake emission. |
| src/main/java/org/ethereum/beacon/discovery/pipeline/handler/UnauthorizedMessagePacketHandler.java | Changes responder handling to resend an existing WHOAREYOU during WHOAREYOU_SENT state. |
| src/main/java/org/ethereum/beacon/discovery/pipeline/handler/NodeSessionManager.java | Keeps multiple outbound nonces per session and adds clearOldNoncesForSession(...). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replaces the AUTHENTICATED-cleanup approach added in the previous commit with a bounded recent-outbound-nonce window per session. Addresses two opposing Copilot review comments on PR Consensys#236: - Unbounded growth post-authentication. The earlier cleanup only fired on the AUTHENTICATED transition; every subsequent sendOutgoingOrdinary added a nonce to NodeSessionManager.lastNonceToSession that was never removed until session delete. - Loss of lost-handshake recovery. The cleanup kept only the most-recent outbound nonce (the handshake's own nonce). If the handshake packet dropped, the responder's resent WHOAREYOU referenced an evicted nonce and WhoAreYouSessionResolver could no longer bind it. New design: - NodeSession tracks the last 8 outbound nonces in a LinkedHashSet (MAX_RECENT_OUTBOUND_NONCES). On each new nonce the eldest is evicted from both the per-session set and lastNonceToSession at the same time. - WhoAreYouPacketHandler accepts a WHOAREYOU iff its nonce is in this window. Broader than "must equal lastOutboundNonce" (so concurrent ordinary packets work, and a responder's resent WHOAREYOU referencing an earlier packet is still bindable). Narrower than "any nonce ever sent" (no leak, stale post-auth challenges evict naturally). - On nonce miss, WhoAreYouPacketHandler now silently ignores the packet per spec §"WHOAREYOU is only ever valid as a response to a previously sent request" rather than cancelling pending requests, so a spurious or stale challenge cannot abort live work. - The hasPendingRequests relaxation introduced earlier is removed; the bounded window subsumes its role. Also strips the stale "Expected to FAIL" Javadoc on HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake (third Copilot comment) — the test passes now.
Cursor's HIGH-severity follow-up on PR Consensys#236: after the bounded-window redesign in b0e6fe1, a delayed WHOAREYOU whose nonce is still in the 8-deep recent window would re-trigger a full handshake on an already authenticated session, overwriting session keys. Fix: gate WhoAreYouPacketHandler on BOTH "nonce is in the recent outbound window" AND "at least one request is pending". This is not the earlier hasPendingRequests relaxation we removed (that one OR'd the two as a fallback); the new check ANDs them as mandatory preconditions. Matches spec §"WHOAREYOU is only ever valid as a response to a previously sent request". Truth table: - first-time handshake / stale-session re-handshake / concurrent pings / lost-handshake recovery: recent=true + pending=true → proceed. - stale duplicate WHOAREYOU after auth (Cursor's case): recent=true but pending=false → ignore. Folded follow-up (same code area): swap requestIdStatuses from ConcurrentHashMap to Collections.synchronizedMap(new LinkedHashMap<>()). ConcurrentHashMap iterates in hash-bucket order, so getFirstAwait/SentRequestInfo() could embed a non-deterministic request in the handshake instead of the one the WHOAREYOU references (spec §189-201). Insertion order ≈ send order ≈ the request the responder challenged against. All call sites of requestIdStatuses are inside NodeSession.synchronized methods so the swap is thread-safe. Residual edge case (acknowledged, not fixed): stale WHOAREYOU + a new unrelated request pending still fires a wasted re-handshake. The responder converges on the new keys and the new request gets answered, so this is wasted crypto rather than a correctness break. A clean fix needs nonce→request tracking; deferred to a follow-up.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit fd751d6. Configure here.
Cursor's Medium-severity follow-up on PR Consensys#236: NodeSession.resetHandshakeState cleared the per-session recentOutboundNonces but left the matching entries in NodeSessionManager.lastNonceToSession. Those orphans linger until session delete (180 s), so per-session map size could exceed eight and stale nonces still resolved to the session — although the two-gate check added in fd751d6 already prevented any handshake on them, the leak and wasted pipeline work were real. Fix: route the bulk eviction through the manager via a new NodeSessionManager.removeNoncesForSession(session, nonces) method, the same way single-nonce eviction in generateNonce already goes through onSessionLastNonceUpdate. Uses Map.remove(key, value) for race-safe removal so a session replacement for the same nonce cannot evict the new mapping.
Two Copilot follow-ups on PR Consensys#236 commit 000c01e: 1. After fd751d6 swapped requestIdStatuses from ConcurrentHashMap to Collections.synchronizedMap(LinkedHashMap), the request-expiration callback in createNextRequest still called requestIdStatuses.remove(...) directly from the scheduler thread, without holding NodeSession's monitor. ConcurrentHashMap tolerated that; LinkedHashMap throws CME when another synchronized reader is iterating values().stream(). Fix: route the callback through the existing private synchronized clearRequestInfo(Bytes) helper, which acquires this monitor and serialises with iteration in getFirstAwait/SentRequestInfo() and cancelAllRequests(). 2. NodeSessionManager.onSessionLastNonceUpdate used key-only lastNonceToSession::remove for the evicted nonce; the companion removeNoncesForSession (added in 000c01e) uses value-conditional Map.remove(key, value). Align onSessionLastNonceUpdate to do the same — a nonce-collision or delete-then-evict race could otherwise drop a fresh mapping owned by a different session.
Two Copilot follow-ups on PR Consensys#236 commit 6a5b90b: 1. WhoAreYouPacketHandler did getFirstAwaitRequestInfo().or( getFirstSentRequestInfo).orElseThrow(...) after the hasPendingRequests gate. The gate and the lookup are separate synchronized calls on NodeSession, so the session monitor is released between them; a concurrent request-expiration callback (clearRequestInfo, also synchronized) can clear the last pending request in that window. orElseThrow then fell into the catch path and called cancelAllRequests — destructive side effect for a benign race. Fix: replace orElseThrow with isEmpty() + ignore-and-return, matching the gate-miss behaviour. 2. HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake still carried a multi-line comment narrating the historical bug in past tense ("This assertion fails today..."). The test passes now; reword to a single line stating the invariant the assertion guards.
Copilot follow-up on PR Consensys#236 commit f7c0895: the embedded-request race-check we added still ran AFTER the two session.setInitiatorKey/ setRecipientKey calls. If the race fired and we returned early, the session's keys had already been overwritten with new HKDF output that the peer didn't know about — corrupting subsequent encrypted traffic on a session whose previous keys were still valid with the peer. Fix: move the two setInitiatorKey/setRecipientKey calls below the embeddedRequestOpt.isEmpty() early-return. The HKDF computation itself stays where it is (pure local computation; cheap to discard on the race path). The non-race path is unchanged — keys are still installed before the handshake is sent.
…ictedNonce Handshake packets use generateNonce() for their header nonce, but WHOAREYOU packets only ever reference *ordinary/random* packet nonces. Counting handshake nonces against the bounded window wastes a slot and, when full, evicts the oldest ordinary nonce — which may still be in-flight. If the handshake is then lost and the responder retransmits WHOAREYOU(Na), the resolver can no longer bind it to the session, causing a stall. Fix: add generateHandshakeNonce() that generates a fresh random nonce without touching recentOutboundNonces or lastNonceToSession; use it in WhoAreYouPacketHandler. Ordinary-packet nonces remain tracked as before. Also rename previousNonce → evictedNonce in onSessionLastNonceUpdate to match the Javadoc description of bounded-window eviction.
…-window tests - NodeSession.getRequestInfo: use Optional.ofNullable and reorder null check so a response arriving for an already-expired request returns Optional.empty() instead of NPE-ing in Optional.of(null) - NodeSession.getFirstPendingRequestInfo: new helper combining AWAIT+SENT search in one synchronized call; used in WhoAreYouPacketHandler to reduce the gate + post-HKDF fetch from four synchronized lock acquisitions to two - NodeSession.resetHandshakeState: pass Set.copyOf(recentOutboundNonces) to the manager so the manager iterates a stable snapshot; the live set is cleared immediately after and would otherwise be empty by the time the manager reads it - NodeSessionTest: five new unit tests covering the recentOutboundNonces window (add, evict, manager notification, handshake-nonce exclusion, reset cleanup)
| if (session.getState() == SessionState.WHOAREYOU_SENT) { | ||
| session.resendFirstOutgoingWhoAreYou(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The race window is real at the Java level — sendOutgoingWhoAreYou and setState(WHOAREYOU_SENT) are two separate synchronized calls with no atomicity guarantee across them. However, it isn't exploitable in practice: NettyDiscoveryServerImpl creates a NioEventLoopGroup(1) (single IO thread), and the Reactor subscription in DiscoveryManagerImpl.start() has no publishOn/subscribeOn, so every incoming-packet handler runs synchronously on that one Netty thread. A second UDP datagram isn't dequeued until the first has finished the entire handler chain, making the interleaving impossible.
We considered switching to !pendingWhoAreYou.isEmpty() as suggested, but it introduces a subtler dependency — correctness then relies on pendingWhoAreYou being reliably cleared in every exit path from WHOAREYOU_SENT (authentication and timeout reset), which is true today but is implicit coupling. The state flag is the authoritative signal for "are we mid-handshake?" and is easier to reason about.
If the gap ever becomes a real concern (e.g. the pipeline is parallelised), the right fix would be to make send + setState atomic inside NodeSession.sendOutgoingWhoAreYou rather than changing what the handler checks against. Leaving as-is for now.

Summary
Besu was the only discv5 client failing the HandshakeResend hive test. Root cause turned out to be four interacting bugs across initiator and responder paths; review feedback from cursor-bot and Copilot tightened the nonce-tracking design into a single bounded window.
Responder side
UnauthorizedMessagePacketHandler. InWHOAREYOU_SENTwe used to issue a fresh challenge with a new nonce, invalidating any handshake the initiator had already signed against the first. Now resends the in-flight challenge (mirrors gethhandleUnknown, v5_udp.go:831-853). The spec section §"Processing of message packets while a challenge is received" is ambiguous on this point; every interoperable client takes this interpretation.NodeSession.resendFirstOutgoingWhoAreYou()— new helper. Reads the eldest entry via the values iterator rather thanget(nonce)becausependingWhoAreYouis aLinkedHashMapwithaccessOrder=trueandgetreorders to the tail; without this, repeated resends would cycle through different entries when multiple challenges are pending.Initiator side
WhoAreYouPacketHandler. Previously we embedded a pending request's message in the handshake but left itsTaskStatusatAWAIT, soNextTaskHandlerre-sent it as a duplicate ordinary packet ~1 s after authentication. The duplicate corrupted the freshly-authenticated session.WhoAreYouPacketHandler. Mandatory AND, not the earlier OR-relaxation. Per spec §"WHOAREYOU is only ever valid as a response to a previously sent request" the pending-request precondition is required; without it, a delayed/replayed WHOAREYOU whose nonce is still in the 8-deep recent window could overwrite live session keys on an established session.NodeSessionnow backsrequestIdStatuseswithCollections.synchronizedMap(new LinkedHashMap<>())instead ofConcurrentHashMap.ConcurrentHashMapiterates in hash-bucket order, so the previous code could embed a non-deterministic request; insertion order ≈ send order ≈ the request the WHOAREYOU's nonce references (spec §189-201). Synchronization is provided byNodeSession's synchronized methods.Session manager — bounded recent-outbound-nonce window per session
WhoAreYouSessionResolverneeds to bind a WHOAREYOU back to the initiator session by the WHOAREYOU's nonce, which mirrors the nonce of an ordinary packet the initiator sent. The original code tracked only the most recent nonce, which broke when several packets were sent before the handshake completed (the responder's WHOAREYOU references the first packet's nonce, not the latest). Naively retaining every nonce instead leaked memory and let stale post-auth WHOAREYOUs re-trigger handshakes.The PR converges on a single design:
NodeSessionkeeps the last 8 outbound nonces in aLinkedHashSet(MAX_RECENT_OUTBOUND_NONCES). On each new nonce the eldest is evicted from both the per-session set andNodeSessionManager.lastNonceToSessiontogether — so the global map stays bounded atnum_sessions × 8.WhoAreYouPacketHandleraccepts a WHOAREYOU iff its nonce is in this window AND at least one request is pending. See the Initiator-side bullet above for why both gates are required.generateNonceand bulk eviction onresetHandshakeState(request-timeout path) both call intoNodeSessionManagersolastNonceToSessionstays consistent with the per-session bounded window — no orphaned mappings linger between handshake-timeout cycles and session delete.removeNoncesForSessionuses value-conditionalMap.remove(k, v)to be race-safe against a session replacement.removeIf(s -> s == removedSession)clears any remaining mappings.Test plan
HandshakeHandlersTest.pendingRequestIsMarkedSentWhenEmbeddedInHandshake— unit test for the AWAIT → SENT fix.DiscoveryManagerTest.callResendAfterHandshake— integration test mirroring geth'sTestUDPv5_callResend; covers the embedded-request-SENT fix and the recent-nonce-window check.DiscoveryManagerTest.testDoubleUnathorizedPing— integration test for two ordinary packets racing duringWHOAREYOU_SENT; covers responder-resend and the multi-nonce session lookup.UnauthorizedMessagePacketHandlerTest.shouldResendFirstExistingWhoAreYouWhenIncomingNonceIsUnknownDuringHandshake— unit test for the responder-resend fix.Verification:
./gradlew buildpasses (full suite).HandshakeResendnow passes. OnlyPingMultiIPstill fails, which is an expected IPv6 limitation when running hive against Docker Desktop on macOS.Closes #234