fix: wire NAT traversal into connection path with unified accept loop#25
fix: wire NAT traversal into connection path with unified accept loop#25jacderida wants to merge 1 commit into
Conversation
ee6cf6f to
688c9ca
Compare
mickvandijke
left a comment
There was a problem hiding this comment.
Deep Review: PR #25 — Wire NAT Traversal into Connection Path
+1070 / -212 across 13 files. The architectural direction is sound and the testnet validation is encouraging, but several issues need addressing before merge.
Note on Greptile's
DefaultHasherclaim: Greptile rated this 2/5 primarily becauseDefaultHasheris "non-deterministic across processes." That specific claim is a false positive —DefaultHasher::new()uses SipHash with fixed keys and IS deterministic today. The relay path works. However, it's fragile (see Medium #3 below).
CRITICAL
1. send_ack_timeout increase from 1s → 30s masks connection failures
Files: src/config/nat_timeouts.rs:141,144
DEFAULT_SEND_ACK_TIMEOUT went from 1s to 30s (30× increase), FAST_SEND_ACK_TIMEOUT from 500ms to 5s (10×). While the reasoning for large NAT chunk transfers is sound, this is a global default that affects all connections — not just NAT-traversed ones. Small control messages will now wait 30s before detecting a dead connection. The doc on TimeoutConfig::send_ack_timeout says "this must be shorter than any outer send timeout" — 30s likely exceeds many callers' timeouts, causing cascading failures.
Fix: Make the timeout adaptive (base + per-byte rate), or expose separate timeouts for control vs. bulk transfers, or scope the increase to NAT-traversed connections only.
2. Stale Connection Reaper Permanently Disabled — Unbounded DashMap Growth
Files: src/nat_traversal_api.rs:3669-3676, src/p2p_endpoint.rs:2733-2740
is_connected() was changed from checking conn.is_alive() (with dead-connection removal) to a bare self.connections.contains_key(addr). Meanwhile, poll_closed_connections() no longer removes dead connections from the DashMap. The stale connection reaper in P2pEndpoint filters on !inner.is_connected(addr) — which now always returns empty.
Impact: Dead connections accumulate forever. broadcast_address_to_peers() sends to dead connections. check_connections_for_observed_addresses scans dead entries every 500ms. The docstring still claims "removes it from the connection table and returns false" — stale and misleading.
Fix: Restore is_alive() in is_connected(), or have poll_closed_connections remove entries after emitting the event (with a grace period for hole-punched connections).
3. event_rx Has Competing Consumers — Events Silently Lost
Files: src/nat_traversal_api.rs:3579-3594, 4851-4855
Three independent code paths consume from the same mpsc::UnboundedReceiver:
spawn_accept_loop— drainsConnectionEstablishedeventsdrain_pending_events(called frompoll()) — drains all eventsaccept_connection(old method, still called fromconnection_router.rs:1299)
Events consumed by one reader are lost to the others. Additionally, spawn_accept_loop uses while let Ok(NatTraversalEvent::ConnectionEstablished { .. }) = erx.try_recv() — a refutable pattern that silently drops any non-ConnectionEstablished event it dequeues.
Fix: Use a broadcast channel, have a single drain point that routes events, or remove the old accept_connection() path and update ConnectionRouter.
HIGH
4. wire_id_from_addr Uses DefaultHasher — Fragile Wire Protocol
Files: src/endpoint.rs:352-362, src/nat_traversal_api.rs:2152-2164
DefaultHasher::new() is deterministic today (fixed SipHash keys), but Rust does not guarantee stability across compiler versions. A mixed-version deployment would silently break relay. Additionally:
- Two identical copies must stay in sync (only a doc comment enforces this)
- 32-byte output has only 64 bits of entropy (same u64 repeated 4×)
Fix: Extract into a shared function. Replace with BLAKE3 (already a dependency): blake3::hash(addr.to_string().as_bytes()).
MEDIUM
5. try_send in Reader Task Silently Drops Data
File: src/p2p_endpoint.rs:2475-2482
The switch from send().await to try_send means when the bounded data channel is full, received bytes are logged at warn! and discarded. QUIC guarantees reliable delivery at transport layer — silently dropping application messages breaks that contract. For 14MB chunk transfers over NAT, bursts can easily saturate the channel.
Fix: Spawn a short-lived task per message that blocks on send().await (bounded by a per-peer semaphore), increase channel capacity, or use tokio::select! with a timeout.
6. cached_remote_addr Stale After Connection Migration
Files: src/high_level/connection.rs:667-668, 1124-1126
remote_address() now returns a cached value set once at construction. After migration, DashMap lookups, data routing, and relay lookups all use a stale address.
Fix: Rename to initial_remote_address() and document, or update the cache on migration events.
7. Fire-and-Forget Fallback Creates Orphaned Connections
File: src/high_level/endpoint.rs:733-758
When hole_punch_tx is None, the fallback calls self.inner.connect(...) but discards both _ch and _conn. No driver is spawned, no cleanup exists — connection state leaks in the low-level Endpoint forever.
Fix: Properly register the connection, or don't create a full QUIC connection in the fallback path.
8. decode_rfc Corrupts Stream on Partial target_peer_id
File: src/frame/nat_traversal_unified.rs:369-380
When has_peer_id == 1 but r.remaining() < 32, the 1-byte flag is already consumed, leaving the stream position off by 1. Subsequent frame parsing will be corrupted. Non-zero non-one values are silently accepted.
Fix: Return Err(UnexpectedEnd) when fewer than 32 bytes remain. Validate has_peer_id is 0 or 1.
9. observed_address() Silently Returns None Under Lock Contention
File: src/high_level/connection.rs:681-688
try_lock returns None during contention, indistinguishable from "no OBSERVED_ADDRESS received." Address discovery may be delayed or miss observations entirely.
Fix: Cache observed address in an AtomicCell outside the connection mutex, or return a tri-state to distinguish contention from absence.
10. Debug error! Logs Left in Production Code
File: src/p2p_endpoint.rs:2812,2819
Two tracing::error!("FORWARDER_DEBUG: ...") calls at error level will trigger production alerts. Trivial fix — change to debug!.
211e396 to
758f16c
Compare
|
Thanks for the thorough review @mickvandijke. All 10 issues addressed in the latest force-push: CRITICAL1. 2. Stale connection reaper — Fixed. 3. HIGH4. MEDIUM5. 6. 7. Fire-and-forget fallback — Added comment explaining the intentional discard is for backward compatibility when 8. 9. 10. Debug |
82d872f to
56fdb2b
Compare
Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination, hole-punching) into the actual connection path so nodes behind NAT can participate in the network. Key changes: - Unified accept path: remove competing accept_connections background task, add accept_connection_direct() as sole accept path for both Quinn incoming and outgoing hole-punch connections - Tracked hole-punch connections: forward addresses from Quinn driver via channel to NatTraversalEndpoint for full connection registration (DashMap, events, reader tasks) instead of fire-and-forget - Background accept loop with parallel handshakes to prevent serialized blocking on slow NAT connections - Stale coordination reset so repeated hole-punches work across client sessions - Dial deduplication to prevent concurrent hole-punch attempts to the same target from deadlocking the runtime - Contention reduction: cached remote_address, try_lock for observed_address, try_send in reader tasks, removed RwLock writes from send/recv hot paths - Increased send_ack_timeout (1s to 30s) for large chunk transfers over NAT-traversed connections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
56fdb2b to
ebe824b
Compare
mickvandijke
left a comment
There was a problem hiding this comment.
Deep Review — Critical & High Issues
Greptile Triage
Greptile's 4 findings (DefaultHasher non-determinism, stale reaper disabled, error! logs, try_send data loss) were all addressed before the PR was pushed. The Greptile review appears to have run against an earlier commit. The wire_id_from_addr in shared.rs:261 now uses deterministic byte packing, is_connected() checks close_reason(), and the debug logs use debug! level.
CRITICAL
C1. decode_auto fallback corrupts buffer after partial RFC decode
src/frame/nat_traversal_unified.rs:393-401
decode_auto tries decode_rfc first. If decode_rfc fails partway through (after consuming round + seq + address bytes), the Buf cursor has already advanced. decode_legacy then starts from the wrong position, interpreting leftover bytes as a new frame.
A malicious peer can craft a frame with a valid RFC prefix but has_peer_id = 2 (line 381) to trigger this path, causing decode_legacy to parse attacker-controlled data as a PunchMeNow frame. This is exploitable in a P2P protocol where any peer can send arbitrary frames.
Fix: Either save/restore the buffer position before attempting RFC decode, or don't fall back to legacy after partial RFC consumption.
C2. spawn_incoming_connection_forwarder registers connections without reader tasks
src/p2p_endpoint.rs:2857-2878
Registers peers in connected_peers (line 2873) and emits PeerConnected (line 2874) but never spawns a reader task. The forwarder only receives a SocketAddr from the channel — it has no Connection handle. Consequences:
recv()never delivers data for these connections- Event-driven cleanup (reader-exit handler) never fires
- Only the stale reaper can clean up, on a 10s interval
This same class of issue affects two more code paths:
try_hole_punch(line 1858-1869) — registers inconnected_peerswithout reader tasksend()lazy registration (lines ~2100-2133) — same problem
All three paths create "zombie" entries that appear connected but silently drop inbound data.
C3. Competing accept() consumers — race condition
src/nat_traversal_api.rs — lines 1542 vs 2716
Both spawn_accept_loop() (spawned unconditionally in new()) and accept_connections() (spawnable via start_listening()) call endpoint.accept() on the same InnerEndpoint. If both run, incoming connections are non-deterministically split between two code paths with different registration logic. The shared emitted_established_events DashSet further complicates this — if one path inserts an address first, the other path's event emission is suppressed.
Fix: Either remove accept_connections/start_listening entirely (since spawn_accept_loop replaces it), or guard against double-spawning with an AtomicBool.
HIGH
H1. Dial dedup key only uses first non-None address
src/p2p_endpoint.rs:1326
let target = target_ipv4.or(target_ipv6);If caller A dials (Some(ipv4), None) and caller B dials (None, Some(ipv6)) to the same peer, they won't be deduplicated — both start parallel hole-punch sessions, which is exactly what dedup was designed to prevent.
H2. Thundering-herd retry after failed primary dial
src/p2p_endpoint.rs:1346-1349, 1366
When the primary dial fails, the pending_dials entry is removed (line 1366) and all waiters receive the error. They all fall through to retry simultaneously with no guard — re-creating the concurrent dial storm that dedup was meant to prevent.
Fix: Either re-insert the pending_dials entry so only one waiter retries, or add exponential backoff with jitter for waiters.
H3. ConnectionMethod::HolePunched incorrect for dedup waiters
src/p2p_endpoint.rs:1341-1343
Waiters always report ConnectionMethod::HolePunched { coordinator: target_addr } regardless of how the primary actually connected (could be DirectIPv4, DirectIPv6, or Relay). This gives callers incorrect telemetry about connection establishment.
Fix: Broadcast the actual ConnectionMethod along with the connection result.
H4. poll_closed_connections emits ConnectionLost on every poll tick
src/nat_traversal_api.rs:4866-4892
During the 5-second grace period before removal, ConnectionLost is emitted on every poll tick, not just once. This spams consumers with duplicate events.
Fix: Only emit on the first observation (when the closed_at entry is newly inserted), or track "already-emitted-lost" separately.
H5. remove_connection doesn't clean up closed_at DashMap
src/nat_traversal_api.rs:3777-3787
If a peer reconnects and disconnects again, the stale closed_at timestamp persists from the first disconnection. The new dead connection gets reaped immediately (if the old timestamp was >5s ago) instead of getting a fresh grace period.
Fix: Add self.closed_at.remove(addr) to remove_connection().
H6. Fire-and-forget connection leak
src/high_level/endpoint.rs:733-761
When hole_punch_tx is not configured, the fallback creates a QUIC connection and discards both handles (_ch, _conn). Relies on idle timeout (typically 30s) for cleanup. Under rapid InitiateHolePunch events (attack or busy network), zombie connections accumulate unboundedly.
Fix: Set a short idle timeout on fire-and-forget connections, or track them for explicit cleanup.
H7. Coordinator selection can pick the target itself
src/p2p_endpoint.rs:1395
config.known_peers.first() is used without filtering out the target address. If known_peers[0] happens to be the peer we're trying to reach, it becomes its own NAT traversal coordinator — which is nonsensical and will fail silently.
The connected_peers fallback path (line 1404) correctly filters the target, but the known_peers path does not.
Fix: Filter known_peers against the target address before selecting a coordinator.
H8. bootstrap_nodes grows unboundedly
src/nat_traversal_api.rs:3711-3727
Every add_connection() pushes a BootstrapNode if the address isn't already present. There is no cap and no eviction of stale entries. Over a long-running node's lifetime with many transient peers, this is a slow memory leak.
Note: Greptile's 4 original findings were all resolved in the current commit. This review covers issues not flagged by Greptile.
|
This work will be explored as part of another branch to achieve continuous uploads. |
Summary
Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination, hole-punching) into the actual connection path so nodes behind NAT can participate in the network.
accept_connectionsbackground task, addaccept_connection_direct()as sole accept path for both Quinn incoming and outgoing hole-punch connectionsNatTraversalEndpointfor full connection registration (DashMap, events, reader tasks) instead of fire-and-forgetremote_address,try_lockforobserved_address,try_sendin reader tasks, removed RwLock writes from send/recv hot pathssend_ack_timeout(1s → 30s) for large chunk transfers over NAT-traversed connectionsTest plan
🤖 Generated with Claude Code
Greptile Summary
This PR wires NAT traversal (PUNCH_ME_NOW coordination, hole-punching, unified accept loop) into the actual connection path. The architectural changes are well-motivated — removing the competing accept task, adding parallel handshakes, dial deduplication, and a background session driver all address real reliability problems. Most of the individual pieces are correctly implemented.
However, one critical flaw blocks the core relay path from working:
DefaultHashercross-process non-determinism (src/endpoint.rs:352,src/nat_traversal_api.rs:2152): Both the requester and the coordinator useDefaultHasherto convert aSocketAddrto a 32-byte wire ID. BecauseDefaultHasheris randomly seeded per-process, the same address produces different bytes on two different machines. The coordinator'sRelayPunchMeNowhandler iterates connections and computeswire_id_from_addr(conn.remote)to find the target — this hash will never match the value the requester computed. The fix is to encode the address bytes directly or use a fixed-seed hash.Additional issues:
src/nat_traversal_api.rs:3669):is_connected()no longer checksis_alive(), andpoll_closed_connectionsno longer removes entries from the DashMap. The reaper inspawn_stale_connection_reaperuses!is_connected()to find dead peers — it will never return stale connections, so connections registered without a reader task (viaspawn_incoming_connection_forwarderor the lazysend()path) leak indefinitely.error!logs left in (src/p2p_endpoint.rs:2812,2819):FORWARDER_DEBUG:messages aterror!level will create false-positive alerts in production log monitoring systems.src/p2p_endpoint.rs:2475):try_sendin the reader task drops data when the channel is full, breaking the application-level reliability guarantee for chunk transfers.Confidence Score: 2/5
DefaultHashercross-process non-determinism inwire_id_from_addrbreaks the coordinator relay lookup entirely. Every relay attempt will fail silently, which is the primary new code path this PR introduces. The stale-reaper regression and debug-levelerror!logs compound the concern. One targeted fix (replaceDefaultHasherwith a stable encoding in bothsrc/endpoint.rsandsrc/nat_traversal_api.rs) would resolve the blocking issue.src/endpoint.rs(wire_id_from_addr),src/nat_traversal_api.rs(wire_id_from_addr + is_connected),src/p2p_endpoint.rs(debug logs + try_send).Important Files Changed
wire_id_from_addrusingDefaultHasherfor relay peer lookup — non-deterministic across processes means the coordinator can never find the target peer, breaking NAT traversal relay entirely.accept_connection_direct;is_connectedno longer checks liveness causing the stale reaper to never trigger; also contains a secondDefaultHasher-basedwire_id_from_addrthat must also be fixed.FORWARDER_DEBUGerror-level logs left in; reader tasktry_sendsilently drops data; forwarder registers connections without spawning reader tasks (no cleanup path).should_resetconditional is slightly redundant.target_peer_idto relay), replaces oldstart_coordination_roundwithInitiateHolePunchendpoint event, adds logging. Changes are targeted and correct.conn.wake()after queuing NAT frames so the QUIC driver flushes them promptly; cachesremote_addressto avoid hot-path mutex acquisition; usestry_lockforobserved_address. All changes are sound.hole_punch_txchannel anddefault_client_configto State; processes hole-punch addresses in the driver loop with proper fallback. Clean, well-structured change.try_lockto both tracking and non-tracking mutex variants — straightforward and correct.DEFAULT_SEND_ACK_TIMEOUTfrom 1s to 30s andFAST_SEND_ACK_TIMEOUTfrom 500ms to 5s to accommodate slow QUIC congestion-window ramp-up on NAT connections; well-justified.target_peer_idencoding/decoding toPunchMeNowframe (1-byte presence flag + 32-byte payload); backward-compatible sinceNoneencodes as a single0x00byte and legacy decoders would error on trailing bytes.dial_addrtoconnect_with_fallbackfor NAT traversal support; correctly uses the actual connected address for subsequent lookups. Reasonable change.InitiateHolePunchvariant and a sender address field toRelayPunchMeNow. Clean extension to the event enum.Sequence Diagram
sequenceDiagram participant A as Peer A (requester) participant C as Coordinator C participant B as Peer B (NATed target) Note over A: connect_with_fallback_inner() A->>C: QUIC connect (already established) Note over A: send_coordination_request()<br/>target_wire_id = wire_id_from_addr(B_addr)<br/>⚠️ uses DefaultHasher (process-local seed) A->>C: PunchMeNow { target_peer_id=wire_id, address=A_addr } Note over C: handle_endpoint_event()<br/>RelayPunchMeNow handler Note over C: for each conn: wire_id_from_addr(conn.remote)<br/>⚠️ different DefaultHasher seed → never matches C--xB: relay FAILS (wire_id mismatch) Note over B: Never receives coordination<br/>Never sends QUIC Initial to A<br/>NAT traversal silently fails alt Happy path (if wire_id fixed) C->>B: relayed PunchMeNow { target_peer_id=None, address=A_addr } Note over B: handle_punch_me_now()<br/>Emits InitiateHolePunch event B->>A: QUIC Initial (creates NAT binding) A->>B: QUIC Initial (simultaneous-open) Note over B: spawn_accept_loop() accepts connection<br/>handshake_tx ← (B_addr, conn) Note over A: accept_connection_direct() returns<br/>P2pEndpoint::accept() registers peer<br/>spawn_reader_task() A->>B: Data (streams open bidirectionally) endComments Outside Diff (1)
src/nat_traversal_api.rs, line 3664-3676 (link)is_connected()now returnstruefor any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact thatpoll_closed_connectionsno longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.The stale connection reaper in
P2pEndpoint::spawn_stale_connection_reaperidentifies dead connections using:Since
is_connectedalways returnstruefor any key in the DashMap, and keys are never removed,!inner.is_connected(addr)is alwaysfalse. The reaper never fires.The primary cleanup path (reader task →
reader_exit_tx→do_cleanup_connection) still works when a reader task exists. However, connections registered through:spawn_incoming_connection_forwarder(no reader task spawned)send()(lines ~2087–2128, no reader task spawned)…have no cleanup path at all. These connections will accumulate in
connected_peersas permanent zombies, and subsequentsend()calls to those addresses will fail withopen_unierrors indefinitely.At minimum,
is_connectedshould still checkconn.is_alive()so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix: wire NAT traversal into connection ..." | Re-trigger Greptile