Skip to content

fix: wire NAT traversal into connection path with unified accept loop#25

Closed
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:feat-nat_traversal_attempts
Closed

fix: wire NAT traversal into connection path with unified accept loop#25
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:feat-nat_traversal_attempts

Conversation

@jacderida

@jacderida jacderida commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination, hole-punching) into the actual connection path so nodes behind NAT can participate in the network.

  • Unified accept path: remove competing accept_connections background task, add accept_connection_direct() as sole accept path for both Quinn incoming and outgoing hole-punch connections
  • Tracked hole-punch connections: forward addresses from Quinn driver via channel to NatTraversalEndpoint for full connection registration (DashMap, events, reader tasks) instead of fire-and-forget
  • Background accept loop with parallel handshakes to prevent serialized blocking on slow NAT connections
  • Stale coordination reset so repeated hole-punches work across client sessions
  • Dial deduplication to prevent concurrent hole-punch attempts to the same target from deadlocking the runtime
  • Contention reduction: cached remote_address, try_lock for observed_address, try_send in reader tasks, removed RwLock writes from send/recv hot paths
  • Increased send_ack_timeout (1s → 30s) for large chunk transfers over NAT-traversed connections

Test plan

  • Tested on 6-node testnet (5 cloud + 1 local behind home router NAT)
  • Hole-punch succeeds end-to-end (client → coordinator → NATed node → NAT binding)
  • Bidirectional communication through NAT (DHT, quotes, chunk transfers)
  • Chunks stored on NATed local node (14MB confirmed)
  • Repeated hole-punches work across client sessions

🤖 Generated with Claude Code

Greptile Summary

This PR wires NAT traversal (PUNCH_ME_NOW coordination, hole-punching, unified accept loop) into the actual connection path. The architectural changes are well-motivated — removing the competing accept task, adding parallel handshakes, dial deduplication, and a background session driver all address real reliability problems. Most of the individual pieces are correctly implemented.

However, one critical flaw blocks the core relay path from working:

  • DefaultHasher cross-process non-determinism (src/endpoint.rs:352, src/nat_traversal_api.rs:2152): Both the requester and the coordinator use DefaultHasher to convert a SocketAddr to a 32-byte wire ID. Because DefaultHasher is randomly seeded per-process, the same address produces different bytes on two different machines. The coordinator's RelayPunchMeNow handler iterates connections and computes wire_id_from_addr(conn.remote) to find the target — this hash will never match the value the requester computed. The fix is to encode the address bytes directly or use a fixed-seed hash.

Additional issues:

  • Stale connection reaper disabled (src/nat_traversal_api.rs:3669): is_connected() no longer checks is_alive(), and poll_closed_connections no longer removes entries from the DashMap. The reaper in spawn_stale_connection_reaper uses !is_connected() to find dead peers — it will never return stale connections, so connections registered without a reader task (via spawn_incoming_connection_forwarder or the lazy send() path) leak indefinitely.
  • Debug error! logs left in (src/p2p_endpoint.rs:2812, 2819): FORWARDER_DEBUG: messages at error! level will create false-positive alerts in production log monitoring systems.
  • Silent data loss (src/p2p_endpoint.rs:2475): try_send in the reader task drops data when the channel is full, breaking the application-level reliability guarantee for chunk transfers.

Confidence Score: 2/5

  • Not safe to merge — the relay lookup is fundamentally broken due to non-deterministic cross-process hashing, meaning NAT traversal will silently fail in the most common case (requester ≠ coordinator).
  • The architectural direction is sound and well-tested on the author's testnet, but the DefaultHasher cross-process non-determinism in wire_id_from_addr breaks the coordinator relay lookup entirely. Every relay attempt will fail silently, which is the primary new code path this PR introduces. The stale-reaper regression and debug-level error! logs compound the concern. One targeted fix (replace DefaultHasher with a stable encoding in both src/endpoint.rs and src/nat_traversal_api.rs) would resolve the blocking issue.
  • src/endpoint.rs (wire_id_from_addr), src/nat_traversal_api.rs (wire_id_from_addr + is_connected), src/p2p_endpoint.rs (debug logs + try_send).

Important Files Changed

Filename Overview
src/endpoint.rs Adds wire_id_from_addr using DefaultHasher for relay peer lookup — non-deterministic across processes means the coordinator can never find the target peer, breaking NAT traversal relay entirely.
src/nat_traversal_api.rs Adds unified accept loop, hole-punch tracking channels, and accept_connection_direct; is_connected no longer checks liveness causing the stale reaper to never trigger; also contains a second DefaultHasher-based wire_id_from_addr that must also be fixed.
src/p2p_endpoint.rs Adds dial deduplication, session driver, incoming connection forwarder, and post-hole-punch direct-connect retry; FORWARDER_DEBUG error-level logs left in; reader task try_send silently drops data; forwarder registers connections without spawning reader tasks (no cleanup path).
src/connection/nat_traversal.rs Adds stale coordination reset and bootstraps a new coordination round when receiving a relayed PUNCH_ME_NOW with no active round; logic is correct, though the should_reset conditional is slightly redundant.
src/connection/mod.rs Fixes coordinator-only path (requires target_peer_id to relay), replaces old start_coordination_round with InitiateHolePunch endpoint event, adds logging. Changes are targeted and correct.
src/high_level/connection.rs Adds conn.wake() after queuing NAT frames so the QUIC driver flushes them promptly; caches remote_address to avoid hot-path mutex acquisition; uses try_lock for observed_address. All changes are sound.
src/high_level/endpoint.rs Adds hole_punch_tx channel and default_client_config to State; processes hole-punch addresses in the driver loop with proper fallback. Clean, well-structured change.
src/high_level/mutex.rs Adds try_lock to both tracking and non-tracking mutex variants — straightforward and correct.
src/config/nat_timeouts.rs Increases DEFAULT_SEND_ACK_TIMEOUT from 1s to 30s and FAST_SEND_ACK_TIMEOUT from 500ms to 5s to accommodate slow QUIC congestion-window ramp-up on NAT connections; well-justified.
src/frame/nat_traversal_unified.rs Adds optional target_peer_id encoding/decoding to PunchMeNow frame (1-byte presence flag + 32-byte payload); backward-compatible since None encodes as a single 0x00 byte and legacy decoders would error on trailing bytes.
src/link_transport_impl.rs Switches dial_addr to connect_with_fallback for NAT traversal support; correctly uses the actual connected address for subsequent lookups. Reasonable change.
src/shared.rs Adds InitiateHolePunch variant and a sender address field to RelayPunchMeNow. Clean extension to the event enum.

Sequence Diagram

sequenceDiagram
    participant A as Peer A (requester)
    participant C as Coordinator C
    participant B as Peer B (NATed target)

    Note over A: connect_with_fallback_inner()
    A->>C: QUIC connect (already established)
    Note over A: send_coordination_request()<br/>target_wire_id = wire_id_from_addr(B_addr)<br/>⚠️ uses DefaultHasher (process-local seed)
    A->>C: PunchMeNow { target_peer_id=wire_id, address=A_addr }

    Note over C: handle_endpoint_event()<br/>RelayPunchMeNow handler
    Note over C: for each conn: wire_id_from_addr(conn.remote)<br/>⚠️ different DefaultHasher seed → never matches
    C--xB: relay FAILS (wire_id mismatch)

    Note over B: Never receives coordination<br/>Never sends QUIC Initial to A<br/>NAT traversal silently fails

    alt Happy path (if wire_id fixed)
        C->>B: relayed PunchMeNow { target_peer_id=None, address=A_addr }
        Note over B: handle_punch_me_now()<br/>Emits InitiateHolePunch event
        B->>A: QUIC Initial (creates NAT binding)
        A->>B: QUIC Initial (simultaneous-open)
        Note over B: spawn_accept_loop() accepts connection<br/>handshake_tx ← (B_addr, conn)
        Note over A: accept_connection_direct() returns<br/>P2pEndpoint::accept() registers peer<br/>spawn_reader_task()
        A->>B: Data (streams open bidirectionally)
    end
Loading

Comments Outside Diff (1)

  1. src/nat_traversal_api.rs, line 3664-3676 (link)

    P1 Stale connection reaper is permanently disabled by this change

    is_connected() now returns true for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that poll_closed_connections no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.

    The stale connection reaper in P2pEndpoint::spawn_stale_connection_reaper identifies dead connections using:

    .filter(|addr| !inner.is_connected(addr))

    Since is_connected always returns true for any key in the DashMap, and keys are never removed, !inner.is_connected(addr) is always false. The reaper never fires.

    The primary cleanup path (reader task → reader_exit_txdo_cleanup_connection) still works when a reader task exists. However, connections registered through:

    • spawn_incoming_connection_forwarder (no reader task spawned)
    • The lazy registration branch inside send() (lines ~2087–2128, no reader task spawned)

    …have no cleanup path at all. These connections will accumulate in connected_peers as permanent zombies, and subsequent send() calls to those addresses will fail with open_uni errors indefinitely.

    At minimum, is_connected should still check conn.is_alive() so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/nat_traversal_api.rs
    Line: 3664-3676
    
    Comment:
    **Stale connection reaper is permanently disabled by this change**
    
    `is_connected()` now returns `true` for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that `poll_closed_connections` no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.
    
    The stale connection reaper in `P2pEndpoint::spawn_stale_connection_reaper` identifies dead connections using:
    ```rust
    .filter(|addr| !inner.is_connected(addr))
    ```
    Since `is_connected` always returns `true` for any key in the DashMap, and keys are never removed, `!inner.is_connected(addr)` is always `false`. The reaper never fires.
    
    The primary cleanup path (reader task → `reader_exit_tx``do_cleanup_connection`) still works when a reader task exists. However, connections registered through:
    - `spawn_incoming_connection_forwarder` (no reader task spawned)
    - The lazy registration branch inside `send()` (lines ~2087–2128, no reader task spawned)
    
    …have no cleanup path at all. These connections will accumulate in `connected_peers` as permanent zombies, and subsequent `send()` calls to those addresses will fail with `open_uni` errors indefinitely.
    
    At minimum, `is_connected` should still check `conn.is_alive()` so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/endpoint.rs
Line: 349-362

Comment:
**`DefaultHasher` is non-deterministic across processes — relay lookup always fails**

`DefaultHasher::new()` uses a randomly seeded internal state that is initialized once per process. Two different processes (the requesting peer computing `target_peer_id` via `NatTraversalEndpoint::wire_id_from_addr`, and the coordinator calling this function to match connections) will produce completely different hashes for the same `SocketAddr`.

The PUNCH_ME_NOW relay path is:
1. Requester (Peer A): calls `NatTraversalEndpoint::wire_id_from_addr(target_addr)` → sends as `target_peer_id` in frame to Coordinator C
2. Coordinator (Peer C): receives frame, calls `endpoint.rs::wire_id_from_addr(conn.remote)` for each connection to find a match

Since A and C are different processes, their `DefaultHasher` seeds differ. The comparison at line 701 (`wire_id == target_peer_id`) will **never** succeed, meaning the relay lookup will always silently fail and NAT traversal coordination will not work.

Both this function and `NatTraversalEndpoint::wire_id_from_addr` must be replaced with a deterministic hash that is stable across processes. Use a fixed-seed algorithm such as SipHash-1-3 with a hard-coded key, or simply encode the raw IP bytes directly into the 32-byte array without hashing:

```rust
fn wire_id_from_addr(addr: SocketAddr) -> [u8; 32] {
    let mut bytes = [0u8; 32];
    match addr {
        SocketAddr::V4(v4) => {
            bytes[0] = 4;
            bytes[1..5].copy_from_slice(&v4.ip().octets());
            bytes[5..7].copy_from_slice(&v4.port().to_be_bytes());
        }
        SocketAddr::V6(v6) => {
            bytes[0] = 6;
            bytes[1..17].copy_from_slice(&v6.ip().octets());
            bytes[17..19].copy_from_slice(&v6.port().to_be_bytes());
        }
    }
    bytes
}
```

The same fix must be applied to `NatTraversalEndpoint::wire_id_from_addr` in `src/nat_traversal_api.rs:2152`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 2812-2819

Comment:
**Debug `error!` level logs left in production code**

Two `tracing::error!` calls tagged `FORWARDER_DEBUG:` are left in this shipping code. Using `error!` for debug traces will flood operator log sinks with false-positive alerts (PagerDuty, Datadog, etc.) on every `P2pEndpoint` construction, which happens at startup for every node.

```suggestion
        debug!("spawn_incoming_connection_forwarder called");
```

And on line 2819:
```suggestion
            debug!("Incoming connection forwarder: started, acquiring rx lock...");
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 3664-3676

Comment:
**Stale connection reaper is permanently disabled by this change**

`is_connected()` now returns `true` for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that `poll_closed_connections` no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.

The stale connection reaper in `P2pEndpoint::spawn_stale_connection_reaper` identifies dead connections using:
```rust
.filter(|addr| !inner.is_connected(addr))
```
Since `is_connected` always returns `true` for any key in the DashMap, and keys are never removed, `!inner.is_connected(addr)` is always `false`. The reaper never fires.

The primary cleanup path (reader task → `reader_exit_tx``do_cleanup_connection`) still works when a reader task exists. However, connections registered through:
- `spawn_incoming_connection_forwarder` (no reader task spawned)
- The lazy registration branch inside `send()` (lines ~2087–2128, no reader task spawned)

…have no cleanup path at all. These connections will accumulate in `connected_peers` as permanent zombies, and subsequent `send()` calls to those addresses will fail with `open_uni` errors indefinitely.

At minimum, `is_connected` should still check `conn.is_alive()` so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 2475-2482

Comment:
**`try_send` silently drops data under backpressure**

The switch from `send().await` to `try_send` trades one failure mode (deadlock) for another (silent data loss). When the bounded data channel is full, `TrySendError::Full` is logged and the received bytes are discarded. For the chunk-transfer use case described in the PR (14 MB chunks over NAT), a momentary burst of incoming streams can easily saturate the channel, causing the application layer to silently miss messages.

QUIC guarantees ordered, reliable delivery at the transport layer — receiving a partial sequence of application messages breaks that guarantee from the caller's perspective.

A safer alternative is to spawn a short-lived task per message that blocks on `send().await`, capped by a per-peer semaphore so the number of in-flight tasks is bounded without dropping data:

```rust
// tokio::spawn bounded send — preserves ordering contract without risking
// a global deadlock (send is on a private task, not a shared worker).
let tx = data_tx.clone();
tokio::spawn(async move {
    let _ = tx.send((addr, data)).await;
});
```

If dropping is intentional (UDP-like semantics), the log level should be `warn!` and the caller must be documented as best-effort.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: wire NAT traversal into connection ..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

Comment thread src/endpoint.rs
Comment thread src/p2p_endpoint.rs Outdated
Comment thread src/p2p_endpoint.rs
@jacderida jacderida force-pushed the feat-nat_traversal_attempts branch from ee6cf6f to 688c9ca Compare March 24, 2026 01:05

@mickvandijke mickvandijke left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: PR #25 — Wire NAT Traversal into Connection Path

+1070 / -212 across 13 files. The architectural direction is sound and the testnet validation is encouraging, but several issues need addressing before merge.

Note on Greptile's DefaultHasher claim: Greptile rated this 2/5 primarily because DefaultHasher is "non-deterministic across processes." That specific claim is a false positiveDefaultHasher::new() uses SipHash with fixed keys and IS deterministic today. The relay path works. However, it's fragile (see Medium #3 below).


CRITICAL

1. send_ack_timeout increase from 1s → 30s masks connection failures

Files: src/config/nat_timeouts.rs:141,144

DEFAULT_SEND_ACK_TIMEOUT went from 1s to 30s (30× increase), FAST_SEND_ACK_TIMEOUT from 500ms to 5s (10×). While the reasoning for large NAT chunk transfers is sound, this is a global default that affects all connections — not just NAT-traversed ones. Small control messages will now wait 30s before detecting a dead connection. The doc on TimeoutConfig::send_ack_timeout says "this must be shorter than any outer send timeout" — 30s likely exceeds many callers' timeouts, causing cascading failures.

Fix: Make the timeout adaptive (base + per-byte rate), or expose separate timeouts for control vs. bulk transfers, or scope the increase to NAT-traversed connections only.


2. Stale Connection Reaper Permanently Disabled — Unbounded DashMap Growth

Files: src/nat_traversal_api.rs:3669-3676, src/p2p_endpoint.rs:2733-2740

is_connected() was changed from checking conn.is_alive() (with dead-connection removal) to a bare self.connections.contains_key(addr). Meanwhile, poll_closed_connections() no longer removes dead connections from the DashMap. The stale connection reaper in P2pEndpoint filters on !inner.is_connected(addr) — which now always returns empty.

Impact: Dead connections accumulate forever. broadcast_address_to_peers() sends to dead connections. check_connections_for_observed_addresses scans dead entries every 500ms. The docstring still claims "removes it from the connection table and returns false" — stale and misleading.

Fix: Restore is_alive() in is_connected(), or have poll_closed_connections remove entries after emitting the event (with a grace period for hole-punched connections).


3. event_rx Has Competing Consumers — Events Silently Lost

Files: src/nat_traversal_api.rs:3579-3594, 4851-4855

Three independent code paths consume from the same mpsc::UnboundedReceiver:

  • spawn_accept_loop — drains ConnectionEstablished events
  • drain_pending_events (called from poll()) — drains all events
  • accept_connection (old method, still called from connection_router.rs:1299)

Events consumed by one reader are lost to the others. Additionally, spawn_accept_loop uses while let Ok(NatTraversalEvent::ConnectionEstablished { .. }) = erx.try_recv() — a refutable pattern that silently drops any non-ConnectionEstablished event it dequeues.

Fix: Use a broadcast channel, have a single drain point that routes events, or remove the old accept_connection() path and update ConnectionRouter.


HIGH

4. wire_id_from_addr Uses DefaultHasher — Fragile Wire Protocol

Files: src/endpoint.rs:352-362, src/nat_traversal_api.rs:2152-2164

DefaultHasher::new() is deterministic today (fixed SipHash keys), but Rust does not guarantee stability across compiler versions. A mixed-version deployment would silently break relay. Additionally:

  • Two identical copies must stay in sync (only a doc comment enforces this)
  • 32-byte output has only 64 bits of entropy (same u64 repeated 4×)

Fix: Extract into a shared function. Replace with BLAKE3 (already a dependency): blake3::hash(addr.to_string().as_bytes()).


MEDIUM

5. try_send in Reader Task Silently Drops Data

File: src/p2p_endpoint.rs:2475-2482

The switch from send().await to try_send means when the bounded data channel is full, received bytes are logged at warn! and discarded. QUIC guarantees reliable delivery at transport layer — silently dropping application messages breaks that contract. For 14MB chunk transfers over NAT, bursts can easily saturate the channel.

Fix: Spawn a short-lived task per message that blocks on send().await (bounded by a per-peer semaphore), increase channel capacity, or use tokio::select! with a timeout.


6. cached_remote_addr Stale After Connection Migration

Files: src/high_level/connection.rs:667-668, 1124-1126

remote_address() now returns a cached value set once at construction. After migration, DashMap lookups, data routing, and relay lookups all use a stale address.

Fix: Rename to initial_remote_address() and document, or update the cache on migration events.


7. Fire-and-Forget Fallback Creates Orphaned Connections

File: src/high_level/endpoint.rs:733-758

When hole_punch_tx is None, the fallback calls self.inner.connect(...) but discards both _ch and _conn. No driver is spawned, no cleanup exists — connection state leaks in the low-level Endpoint forever.

Fix: Properly register the connection, or don't create a full QUIC connection in the fallback path.


8. decode_rfc Corrupts Stream on Partial target_peer_id

File: src/frame/nat_traversal_unified.rs:369-380

When has_peer_id == 1 but r.remaining() < 32, the 1-byte flag is already consumed, leaving the stream position off by 1. Subsequent frame parsing will be corrupted. Non-zero non-one values are silently accepted.

Fix: Return Err(UnexpectedEnd) when fewer than 32 bytes remain. Validate has_peer_id is 0 or 1.


9. observed_address() Silently Returns None Under Lock Contention

File: src/high_level/connection.rs:681-688

try_lock returns None during contention, indistinguishable from "no OBSERVED_ADDRESS received." Address discovery may be delayed or miss observations entirely.

Fix: Cache observed address in an AtomicCell outside the connection mutex, or return a tri-state to distinguish contention from absence.


10. Debug error! Logs Left in Production Code

File: src/p2p_endpoint.rs:2812,2819

Two tracing::error!("FORWARDER_DEBUG: ...") calls at error level will trigger production alerts. Trivial fix — change to debug!.

@jacderida jacderida force-pushed the feat-nat_traversal_attempts branch 2 times, most recently from 211e396 to 758f16c Compare March 24, 2026 16:03
@jacderida

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @mickvandijke. All 10 issues addressed in the latest force-push:

CRITICAL

1. send_ack_timeout increase — Restored defaults to 5s/2s. The send path now computes an adaptive timeout: max(config_timeout, data.len() / 100_000 seconds). Small control messages use the 5s default; large chunk transfers get proportionally more time.

2. Stale connection reaper — Fixed. is_connected() now checks close_reason() and removes dead connections. poll_closed_connections() removes dead connections after a 5-second grace period (tracked via a closed_at DashMap) to avoid racing with hole-punch setup.

3. event_rx competing consumers — Removed the event_rx drain from spawn_accept_loop entirely. The accept loop now relies on incoming_notify and scans the connections DashMap directly for newly-emitted addresses. Only poll() and accept_connection_direct() consume from event_rx, and accept_connection_direct only does a single try_recv per call.

HIGH

4. wire_id_from_addr — Replaced both copies with a deterministic byte encoding (version byte + raw IP octets + port, zero-padded to 32 bytes) in a shared function at src/shared.rs. No hashing involved. Both endpoint.rs and nat_traversal_api.rs delegate to crate::shared::wire_id_from_addr.

MEDIUM

5. try_send data loss — Instead of dropping data on a full channel, the reader task now spawns a bounded task that retries send().await with a 5-second timeout. Data is only dropped if the timeout expires.

6. cached_remote_addr — Renamed to initial_remote_addr. Added doc comment to remote_address() noting it returns the address at connection creation time and may not reflect connection migration.

7. Fire-and-forget fallback — Added comment explaining the intentional discard is for backward compatibility when hole_punch_tx is not configured. Quinn's internal idle timeout handles cleanup.

8. decode_rfc partial target_peer_id — Now returns Err(UnexpectedEnd) when has_peer_id == 1 but fewer than 32 bytes remain. Invalid flag values (not 0 or 1) are also rejected.

9. observed_address() try_lock — Added doc comment explaining that None may indicate lock contention rather than absence of observed address data.

10. Debug error! logs — Changed both FORWARDER_DEBUG calls to debug!().

@jacderida jacderida force-pushed the feat-nat_traversal_attempts branch 2 times, most recently from 82d872f to 56fdb2b Compare March 24, 2026 16:20
Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination,
hole-punching) into the actual connection path so nodes behind NAT can
participate in the network.

Key changes:
- Unified accept path: remove competing accept_connections background
  task, add accept_connection_direct() as sole accept path for both
  Quinn incoming and outgoing hole-punch connections
- Tracked hole-punch connections: forward addresses from Quinn driver
  via channel to NatTraversalEndpoint for full connection registration
  (DashMap, events, reader tasks) instead of fire-and-forget
- Background accept loop with parallel handshakes to prevent
  serialized blocking on slow NAT connections
- Stale coordination reset so repeated hole-punches work across
  client sessions
- Dial deduplication to prevent concurrent hole-punch attempts to the
  same target from deadlocking the runtime
- Contention reduction: cached remote_address, try_lock for
  observed_address, try_send in reader tasks, removed RwLock writes
  from send/recv hot paths
- Increased send_ack_timeout (1s to 30s) for large chunk transfers
  over NAT-traversed connections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the feat-nat_traversal_attempts branch from 56fdb2b to ebe824b Compare March 24, 2026 16:28

@mickvandijke mickvandijke left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review — Critical & High Issues

Greptile Triage

Greptile's 4 findings (DefaultHasher non-determinism, stale reaper disabled, error! logs, try_send data loss) were all addressed before the PR was pushed. The Greptile review appears to have run against an earlier commit. The wire_id_from_addr in shared.rs:261 now uses deterministic byte packing, is_connected() checks close_reason(), and the debug logs use debug! level.


CRITICAL

C1. decode_auto fallback corrupts buffer after partial RFC decode

src/frame/nat_traversal_unified.rs:393-401

decode_auto tries decode_rfc first. If decode_rfc fails partway through (after consuming round + seq + address bytes), the Buf cursor has already advanced. decode_legacy then starts from the wrong position, interpreting leftover bytes as a new frame.

A malicious peer can craft a frame with a valid RFC prefix but has_peer_id = 2 (line 381) to trigger this path, causing decode_legacy to parse attacker-controlled data as a PunchMeNow frame. This is exploitable in a P2P protocol where any peer can send arbitrary frames.

Fix: Either save/restore the buffer position before attempting RFC decode, or don't fall back to legacy after partial RFC consumption.


C2. spawn_incoming_connection_forwarder registers connections without reader tasks

src/p2p_endpoint.rs:2857-2878

Registers peers in connected_peers (line 2873) and emits PeerConnected (line 2874) but never spawns a reader task. The forwarder only receives a SocketAddr from the channel — it has no Connection handle. Consequences:

  • recv() never delivers data for these connections
  • Event-driven cleanup (reader-exit handler) never fires
  • Only the stale reaper can clean up, on a 10s interval

This same class of issue affects two more code paths:

  • try_hole_punch (line 1858-1869) — registers in connected_peers without reader task
  • send() lazy registration (lines ~2100-2133) — same problem

All three paths create "zombie" entries that appear connected but silently drop inbound data.


C3. Competing accept() consumers — race condition

src/nat_traversal_api.rs — lines 1542 vs 2716

Both spawn_accept_loop() (spawned unconditionally in new()) and accept_connections() (spawnable via start_listening()) call endpoint.accept() on the same InnerEndpoint. If both run, incoming connections are non-deterministically split between two code paths with different registration logic. The shared emitted_established_events DashSet further complicates this — if one path inserts an address first, the other path's event emission is suppressed.

Fix: Either remove accept_connections/start_listening entirely (since spawn_accept_loop replaces it), or guard against double-spawning with an AtomicBool.


HIGH

H1. Dial dedup key only uses first non-None address

src/p2p_endpoint.rs:1326

let target = target_ipv4.or(target_ipv6);

If caller A dials (Some(ipv4), None) and caller B dials (None, Some(ipv6)) to the same peer, they won't be deduplicated — both start parallel hole-punch sessions, which is exactly what dedup was designed to prevent.


H2. Thundering-herd retry after failed primary dial

src/p2p_endpoint.rs:1346-1349, 1366

When the primary dial fails, the pending_dials entry is removed (line 1366) and all waiters receive the error. They all fall through to retry simultaneously with no guard — re-creating the concurrent dial storm that dedup was meant to prevent.

Fix: Either re-insert the pending_dials entry so only one waiter retries, or add exponential backoff with jitter for waiters.


H3. ConnectionMethod::HolePunched incorrect for dedup waiters

src/p2p_endpoint.rs:1341-1343

Waiters always report ConnectionMethod::HolePunched { coordinator: target_addr } regardless of how the primary actually connected (could be DirectIPv4, DirectIPv6, or Relay). This gives callers incorrect telemetry about connection establishment.

Fix: Broadcast the actual ConnectionMethod along with the connection result.


H4. poll_closed_connections emits ConnectionLost on every poll tick

src/nat_traversal_api.rs:4866-4892

During the 5-second grace period before removal, ConnectionLost is emitted on every poll tick, not just once. This spams consumers with duplicate events.

Fix: Only emit on the first observation (when the closed_at entry is newly inserted), or track "already-emitted-lost" separately.


H5. remove_connection doesn't clean up closed_at DashMap

src/nat_traversal_api.rs:3777-3787

If a peer reconnects and disconnects again, the stale closed_at timestamp persists from the first disconnection. The new dead connection gets reaped immediately (if the old timestamp was >5s ago) instead of getting a fresh grace period.

Fix: Add self.closed_at.remove(addr) to remove_connection().


H6. Fire-and-forget connection leak

src/high_level/endpoint.rs:733-761

When hole_punch_tx is not configured, the fallback creates a QUIC connection and discards both handles (_ch, _conn). Relies on idle timeout (typically 30s) for cleanup. Under rapid InitiateHolePunch events (attack or busy network), zombie connections accumulate unboundedly.

Fix: Set a short idle timeout on fire-and-forget connections, or track them for explicit cleanup.


H7. Coordinator selection can pick the target itself

src/p2p_endpoint.rs:1395

config.known_peers.first() is used without filtering out the target address. If known_peers[0] happens to be the peer we're trying to reach, it becomes its own NAT traversal coordinator — which is nonsensical and will fail silently.

The connected_peers fallback path (line 1404) correctly filters the target, but the known_peers path does not.

Fix: Filter known_peers against the target address before selecting a coordinator.


H8. bootstrap_nodes grows unboundedly

src/nat_traversal_api.rs:3711-3727

Every add_connection() pushes a BootstrapNode if the address isn't already present. There is no cap and no eviction of stale entries. Over a long-running node's lifetime with many transient peers, this is a slow memory leak.


Note: Greptile's 4 original findings were all resolved in the current commit. This review covers issues not flagged by Greptile.

@jacderida

Copy link
Copy Markdown
Contributor Author

This work will be explored as part of another branch to achieve continuous uploads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants