fix: wire NAT traversal into connection path with unified accept loop by jacderida · Pull Request #25 · WithAutonomi/saorsa-transport

jacderida · 2026-03-24T00:53:21Z

Summary

Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination, hole-punching) into the actual connection path so nodes behind NAT can participate in the network.

Unified accept path: remove competing accept_connections background task, add accept_connection_direct() as sole accept path for both Quinn incoming and outgoing hole-punch connections
Tracked hole-punch connections: forward addresses from Quinn driver via channel to NatTraversalEndpoint for full connection registration (DashMap, events, reader tasks) instead of fire-and-forget
Background accept loop with parallel handshakes to prevent serialized blocking on slow NAT connections
Stale coordination reset so repeated hole-punches work across client sessions
Dial deduplication to prevent concurrent hole-punch attempts to the same target from deadlocking the runtime
Contention reduction: cached remote_address, try_lock for observed_address, try_send in reader tasks, removed RwLock writes from send/recv hot paths
Increased send_ack_timeout (1s → 30s) for large chunk transfers over NAT-traversed connections

Test plan

Tested on 6-node testnet (5 cloud + 1 local behind home router NAT)
Hole-punch succeeds end-to-end (client → coordinator → NATed node → NAT binding)
Bidirectional communication through NAT (DHT, quotes, chunk transfers)
Chunks stored on NATed local node (14MB confirmed)
Repeated hole-punches work across client sessions

🤖 Generated with Claude Code

Greptile Summary

This PR wires NAT traversal (PUNCH_ME_NOW coordination, hole-punching, unified accept loop) into the actual connection path. The architectural changes are well-motivated — removing the competing accept task, adding parallel handshakes, dial deduplication, and a background session driver all address real reliability problems. Most of the individual pieces are correctly implemented.

However, one critical flaw blocks the core relay path from working:

DefaultHasher cross-process non-determinism (src/endpoint.rs:352, src/nat_traversal_api.rs:2152): Both the requester and the coordinator use DefaultHasher to convert a SocketAddr to a 32-byte wire ID. Because DefaultHasher is randomly seeded per-process, the same address produces different bytes on two different machines. The coordinator's RelayPunchMeNow handler iterates connections and computes wire_id_from_addr(conn.remote) to find the target — this hash will never match the value the requester computed. The fix is to encode the address bytes directly or use a fixed-seed hash.

Additional issues:

Stale connection reaper disabled (src/nat_traversal_api.rs:3669): is_connected() no longer checks is_alive(), and poll_closed_connections no longer removes entries from the DashMap. The reaper in spawn_stale_connection_reaper uses !is_connected() to find dead peers — it will never return stale connections, so connections registered without a reader task (via spawn_incoming_connection_forwarder or the lazy send() path) leak indefinitely.
Debug error! logs left in (src/p2p_endpoint.rs:2812, 2819): FORWARDER_DEBUG: messages at error! level will create false-positive alerts in production log monitoring systems.
Silent data loss (src/p2p_endpoint.rs:2475): try_send in the reader task drops data when the channel is full, breaking the application-level reliability guarantee for chunk transfers.

Confidence Score: 2/5

Not safe to merge — the relay lookup is fundamentally broken due to non-deterministic cross-process hashing, meaning NAT traversal will silently fail in the most common case (requester ≠ coordinator).
The architectural direction is sound and well-tested on the author's testnet, but the DefaultHasher cross-process non-determinism in wire_id_from_addr breaks the coordinator relay lookup entirely. Every relay attempt will fail silently, which is the primary new code path this PR introduces. The stale-reaper regression and debug-level error! logs compound the concern. One targeted fix (replace DefaultHasher with a stable encoding in both src/endpoint.rs and src/nat_traversal_api.rs) would resolve the blocking issue.
src/endpoint.rs (wire_id_from_addr), src/nat_traversal_api.rs (wire_id_from_addr + is_connected), src/p2p_endpoint.rs (debug logs + try_send).

Important Files Changed

Filename	Overview
src/endpoint.rs	Adds `wire_id_from_addr` using `DefaultHasher` for relay peer lookup — non-deterministic across processes means the coordinator can never find the target peer, breaking NAT traversal relay entirely.
src/nat_traversal_api.rs	Adds unified accept loop, hole-punch tracking channels, and `accept_connection_direct`; `is_connected` no longer checks liveness causing the stale reaper to never trigger; also contains a second `DefaultHasher`-based `wire_id_from_addr` that must also be fixed.
src/p2p_endpoint.rs	Adds dial deduplication, session driver, incoming connection forwarder, and post-hole-punch direct-connect retry; `FORWARDER_DEBUG` error-level logs left in; reader task `try_send` silently drops data; forwarder registers connections without spawning reader tasks (no cleanup path).
src/connection/nat_traversal.rs	Adds stale coordination reset and bootstraps a new coordination round when receiving a relayed PUNCH_ME_NOW with no active round; logic is correct, though the `should_reset` conditional is slightly redundant.
src/connection/mod.rs	Fixes coordinator-only path (requires `target_peer_id` to relay), replaces old `start_coordination_round` with `InitiateHolePunch` endpoint event, adds logging. Changes are targeted and correct.
src/high_level/connection.rs	Adds `conn.wake()` after queuing NAT frames so the QUIC driver flushes them promptly; caches `remote_address` to avoid hot-path mutex acquisition; uses `try_lock` for `observed_address`. All changes are sound.
src/high_level/endpoint.rs	Adds `hole_punch_tx` channel and `default_client_config` to State; processes hole-punch addresses in the driver loop with proper fallback. Clean, well-structured change.
src/high_level/mutex.rs	Adds `try_lock` to both tracking and non-tracking mutex variants — straightforward and correct.
src/config/nat_timeouts.rs	Increases `DEFAULT_SEND_ACK_TIMEOUT` from 1s to 30s and `FAST_SEND_ACK_TIMEOUT` from 500ms to 5s to accommodate slow QUIC congestion-window ramp-up on NAT connections; well-justified.
src/frame/nat_traversal_unified.rs	Adds optional `target_peer_id` encoding/decoding to `PunchMeNow` frame (1-byte presence flag + 32-byte payload); backward-compatible since `None` encodes as a single `0x00` byte and legacy decoders would error on trailing bytes.
src/link_transport_impl.rs	Switches `dial_addr` to `connect_with_fallback` for NAT traversal support; correctly uses the actual connected address for subsequent lookups. Reasonable change.
src/shared.rs	Adds `InitiateHolePunch` variant and a sender address field to `RelayPunchMeNow`. Clean extension to the event enum.

Sequence Diagram

sequenceDiagram
    participant A as Peer A (requester)
    participant C as Coordinator C
    participant B as Peer B (NATed target)

    Note over A: connect_with_fallback_inner()
    A->>C: QUIC connect (already established)
    Note over A: send_coordination_request()<br/>target_wire_id = wire_id_from_addr(B_addr)<br/>⚠️ uses DefaultHasher (process-local seed)
    A->>C: PunchMeNow { target_peer_id=wire_id, address=A_addr }

    Note over C: handle_endpoint_event()<br/>RelayPunchMeNow handler
    Note over C: for each conn: wire_id_from_addr(conn.remote)<br/>⚠️ different DefaultHasher seed → never matches
    C--xB: relay FAILS (wire_id mismatch)

    Note over B: Never receives coordination<br/>Never sends QUIC Initial to A<br/>NAT traversal silently fails

    alt Happy path (if wire_id fixed)
        C->>B: relayed PunchMeNow { target_peer_id=None, address=A_addr }
        Note over B: handle_punch_me_now()<br/>Emits InitiateHolePunch event
        B->>A: QUIC Initial (creates NAT binding)
        A->>B: QUIC Initial (simultaneous-open)
        Note over B: spawn_accept_loop() accepts connection<br/>handshake_tx ← (B_addr, conn)
        Note over A: accept_connection_direct() returns<br/>P2pEndpoint::accept() registers peer<br/>spawn_reader_task()
        A->>B: Data (streams open bidirectionally)
    end

Comments Outside Diff (1)

src/nat_traversal_api.rs, line 3664-3676 (link)

Stale connection reaper is permanently disabled by this change

is_connected() now returns true for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that poll_closed_connections no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.

The stale connection reaper in P2pEndpoint::spawn_stale_connection_reaper identifies dead connections using:

.filter(|addr| !inner.is_connected(addr))

Since is_connected always returns true for any key in the DashMap, and keys are never removed, !inner.is_connected(addr) is always false. The reaper never fires.

The primary cleanup path (reader task → reader_exit_tx → do_cleanup_connection) still works when a reader task exists. However, connections registered through:

spawn_incoming_connection_forwarder (no reader task spawned)
The lazy registration branch inside send() (lines ~2087–2128, no reader task spawned)

…have no cleanup path at all. These connections will accumulate in connected_peers as permanent zombies, and subsequent send() calls to those addresses will fail with open_uni errors indefinitely.

At minimum, is_connected should still check conn.is_alive() so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 3664-3676

Comment:
**Stale connection reaper is permanently disabled by this change**

`is_connected()` now returns `true` for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that `poll_closed_connections` no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.

The stale connection reaper in `P2pEndpoint::spawn_stale_connection_reaper` identifies dead connections using:
```rust
.filter(|addr| !inner.is_connected(addr))
```
Since `is_connected` always returns `true` for any key in the DashMap, and keys are never removed, `!inner.is_connected(addr)` is always `false`. The reaper never fires.

The primary cleanup path (reader task → `reader_exit_tx` → `do_cleanup_connection`) still works when a reader task exists. However, connections registered through:
- `spawn_incoming_connection_forwarder` (no reader task spawned)
- The lazy registration branch inside `send()` (lines ~2087–2128, no reader task spawned)

…have no cleanup path at all. These connections will accumulate in `connected_peers` as permanent zombies, and subsequent `send()` calls to those addresses will fail with `open_uni` errors indefinitely.

At minimum, `is_connected` should still check `conn.is_alive()` so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/endpoint.rs
Line: 349-362

Comment:
**`DefaultHasher` is non-deterministic across processes — relay lookup always fails**

`DefaultHasher::new()` uses a randomly seeded internal state that is initialized once per process. Two different processes (the requesting peer computing `target_peer_id` via `NatTraversalEndpoint::wire_id_from_addr`, and the coordinator calling this function to match connections) will produce completely different hashes for the same `SocketAddr`.

The PUNCH_ME_NOW relay path is:
1. Requester (Peer A): calls `NatTraversalEndpoint::wire_id_from_addr(target_addr)` → sends as `target_peer_id` in frame to Coordinator C
2. Coordinator (Peer C): receives frame, calls `endpoint.rs::wire_id_from_addr(conn.remote)` for each connection to find a match

Since A and C are different processes, their `DefaultHasher` seeds differ. The comparison at line 701 (`wire_id == target_peer_id`) will **never** succeed, meaning the relay lookup will always silently fail and NAT traversal coordination will not work.

Both this function and `NatTraversalEndpoint::wire_id_from_addr` must be replaced with a deterministic hash that is stable across processes. Use a fixed-seed algorithm such as SipHash-1-3 with a hard-coded key, or simply encode the raw IP bytes directly into the 32-byte array without hashing:

```rust
fn wire_id_from_addr(addr: SocketAddr) -> [u8; 32] {
    let mut bytes = [0u8; 32];
    match addr {
        SocketAddr::V4(v4) => {
            bytes[0] = 4;
            bytes[1..5].copy_from_slice(&v4.ip().octets());
            bytes[5..7].copy_from_slice(&v4.port().to_be_bytes());
        }
        SocketAddr::V6(v6) => {
            bytes[0] = 6;
            bytes[1..17].copy_from_slice(&v6.ip().octets());
            bytes[17..19].copy_from_slice(&v6.port().to_be_bytes());
        }
    }
    bytes
}
```

The same fix must be applied to `NatTraversalEndpoint::wire_id_from_addr` in `src/nat_traversal_api.rs:2152`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 2812-2819

Comment:
**Debug `error!` level logs left in production code**

Two `tracing::error!` calls tagged `FORWARDER_DEBUG:` are left in this shipping code. Using `error!` for debug traces will flood operator log sinks with false-positive alerts (PagerDuty, Datadog, etc.) on every `P2pEndpoint` construction, which happens at startup for every node.

```suggestion
        debug!("spawn_incoming_connection_forwarder called");
```

And on line 2819:
```suggestion
            debug!("Incoming connection forwarder: started, acquiring rx lock...");
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 3664-3676

Comment:
**Stale connection reaper is permanently disabled by this change**

`is_connected()` now returns `true` for any address present in the DashMap (line 3675), regardless of whether the underlying QUIC connection is alive or closed. Combined with the fact that `poll_closed_connections` no longer removes entries from the DashMap (it now only emits events), the DashMap entries for dead connections are never removed.

The stale connection reaper in `P2pEndpoint::spawn_stale_connection_reaper` identifies dead connections using:
```rust
.filter(|addr| !inner.is_connected(addr))
```
Since `is_connected` always returns `true` for any key in the DashMap, and keys are never removed, `!inner.is_connected(addr)` is always `false`. The reaper never fires.

The primary cleanup path (reader task → `reader_exit_tx` → `do_cleanup_connection`) still works when a reader task exists. However, connections registered through:
- `spawn_incoming_connection_forwarder` (no reader task spawned)
- The lazy registration branch inside `send()` (lines ~2087–2128, no reader task spawned)

…have no cleanup path at all. These connections will accumulate in `connected_peers` as permanent zombies, and subsequent `send()` calls to those addresses will fail with `open_uni` errors indefinitely.

At minimum, `is_connected` should still check `conn.is_alive()` so the reaper can detect connections that have entered a closed/draining state, or a new explicit cleanup must be triggered from the registrations that skip the reader-task path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 2475-2482

Comment:
**`try_send` silently drops data under backpressure**

The switch from `send().await` to `try_send` trades one failure mode (deadlock) for another (silent data loss). When the bounded data channel is full, `TrySendError::Full` is logged and the received bytes are discarded. For the chunk-transfer use case described in the PR (14 MB chunks over NAT), a momentary burst of incoming streams can easily saturate the channel, causing the application layer to silently miss messages.

QUIC guarantees ordered, reliable delivery at the transport layer — receiving a partial sequence of application messages breaks that guarantee from the caller's perspective.

A safer alternative is to spawn a short-lived task per message that blocks on `send().await`, capped by a per-peer semaphore so the number of in-flight tasks is bounded without dropping data:

```rust
// tokio::spawn bounded send — preserves ordering contract without risking
// a global deadlock (send is on a private task, not a shared worker).
let tx = data_tx.clone();
tokio::spawn(async move {
    let _ = tx.send((addr, data)).await;
});
```

If dropping is intentional (UDP-like semantics), the log level should be `warn!` and the caller must be documented as best-effort.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: wire NAT traversal into connection ..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

mickvandijke

Deep Review: PR #25 — Wire NAT Traversal into Connection Path

+1070 / -212 across 13 files. The architectural direction is sound and the testnet validation is encouraging, but several issues need addressing before merge.

Note on Greptile's DefaultHasher claim: Greptile rated this 2/5 primarily because DefaultHasher is "non-deterministic across processes." That specific claim is a false positive — DefaultHasher::new() uses SipHash with fixed keys and IS deterministic today. The relay path works. However, it's fragile (see Medium #3 below).

CRITICAL

1. `send_ack_timeout` increase from 1s → 30s masks connection failures

Files: src/config/nat_timeouts.rs:141,144

DEFAULT_SEND_ACK_TIMEOUT went from 1s to 30s (30× increase), FAST_SEND_ACK_TIMEOUT from 500ms to 5s (10×). While the reasoning for large NAT chunk transfers is sound, this is a global default that affects all connections — not just NAT-traversed ones. Small control messages will now wait 30s before detecting a dead connection. The doc on TimeoutConfig::send_ack_timeout says "this must be shorter than any outer send timeout" — 30s likely exceeds many callers' timeouts, causing cascading failures.

Fix: Make the timeout adaptive (base + per-byte rate), or expose separate timeouts for control vs. bulk transfers, or scope the increase to NAT-traversed connections only.

2. Stale Connection Reaper Permanently Disabled — Unbounded DashMap Growth

Files: src/nat_traversal_api.rs:3669-3676, src/p2p_endpoint.rs:2733-2740

is_connected() was changed from checking conn.is_alive() (with dead-connection removal) to a bare self.connections.contains_key(addr). Meanwhile, poll_closed_connections() no longer removes dead connections from the DashMap. The stale connection reaper in P2pEndpoint filters on !inner.is_connected(addr) — which now always returns empty.

Impact: Dead connections accumulate forever. broadcast_address_to_peers() sends to dead connections. check_connections_for_observed_addresses scans dead entries every 500ms. The docstring still claims "removes it from the connection table and returns false" — stale and misleading.

Fix: Restore is_alive() in is_connected(), or have poll_closed_connections remove entries after emitting the event (with a grace period for hole-punched connections).

3. `event_rx` Has Competing Consumers — Events Silently Lost

Files: src/nat_traversal_api.rs:3579-3594, 4851-4855

Three independent code paths consume from the same mpsc::UnboundedReceiver:

spawn_accept_loop — drains ConnectionEstablished events
drain_pending_events (called from poll()) — drains all events
accept_connection (old method, still called from connection_router.rs:1299)

Events consumed by one reader are lost to the others. Additionally, spawn_accept_loop uses while let Ok(NatTraversalEvent::ConnectionEstablished { .. }) = erx.try_recv() — a refutable pattern that silently drops any non-ConnectionEstablished event it dequeues.

Fix: Use a broadcast channel, have a single drain point that routes events, or remove the old accept_connection() path and update ConnectionRouter.

HIGH

4. `wire_id_from_addr` Uses `DefaultHasher` — Fragile Wire Protocol

Files: src/endpoint.rs:352-362, src/nat_traversal_api.rs:2152-2164

DefaultHasher::new() is deterministic today (fixed SipHash keys), but Rust does not guarantee stability across compiler versions. A mixed-version deployment would silently break relay. Additionally:

Two identical copies must stay in sync (only a doc comment enforces this)
32-byte output has only 64 bits of entropy (same u64 repeated 4×)

Fix: Extract into a shared function. Replace with BLAKE3 (already a dependency): blake3::hash(addr.to_string().as_bytes()).

MEDIUM

5. `try_send` in Reader Task Silently Drops Data

File: src/p2p_endpoint.rs:2475-2482

The switch from send().await to try_send means when the bounded data channel is full, received bytes are logged at warn! and discarded. QUIC guarantees reliable delivery at transport layer — silently dropping application messages breaks that contract. For 14MB chunk transfers over NAT, bursts can easily saturate the channel.

Fix: Spawn a short-lived task per message that blocks on send().await (bounded by a per-peer semaphore), increase channel capacity, or use tokio::select! with a timeout.

6. `cached_remote_addr` Stale After Connection Migration

Files: src/high_level/connection.rs:667-668, 1124-1126

remote_address() now returns a cached value set once at construction. After migration, DashMap lookups, data routing, and relay lookups all use a stale address.

Fix: Rename to initial_remote_address() and document, or update the cache on migration events.

7. Fire-and-Forget Fallback Creates Orphaned Connections

File: src/high_level/endpoint.rs:733-758

When hole_punch_tx is None, the fallback calls self.inner.connect(...) but discards both _ch and _conn. No driver is spawned, no cleanup exists — connection state leaks in the low-level Endpoint forever.

Fix: Properly register the connection, or don't create a full QUIC connection in the fallback path.

8. `decode_rfc` Corrupts Stream on Partial `target_peer_id`

File: src/frame/nat_traversal_unified.rs:369-380

When has_peer_id == 1 but r.remaining() < 32, the 1-byte flag is already consumed, leaving the stream position off by 1. Subsequent frame parsing will be corrupted. Non-zero non-one values are silently accepted.

Fix: Return Err(UnexpectedEnd) when fewer than 32 bytes remain. Validate has_peer_id is 0 or 1.

9. `observed_address()` Silently Returns `None` Under Lock Contention

File: src/high_level/connection.rs:681-688

try_lock returns None during contention, indistinguishable from "no OBSERVED_ADDRESS received." Address discovery may be delayed or miss observations entirely.

Fix: Cache observed address in an AtomicCell outside the connection mutex, or return a tri-state to distinguish contention from absence.

10. Debug `error!` Logs Left in Production Code

File: src/p2p_endpoint.rs:2812,2819

Two tracing::error!("FORWARDER_DEBUG: ...") calls at error level will trigger production alerts. Trivial fix — change to debug!.

jacderida · 2026-03-24T16:05:53Z

Thanks for the thorough review @mickvandijke. All 10 issues addressed in the latest force-push:

CRITICAL

1. send_ack_timeout increase — Restored defaults to 5s/2s. The send path now computes an adaptive timeout: max(config_timeout, data.len() / 100_000 seconds). Small control messages use the 5s default; large chunk transfers get proportionally more time.

2. Stale connection reaper — Fixed. is_connected() now checks close_reason() and removes dead connections. poll_closed_connections() removes dead connections after a 5-second grace period (tracked via a closed_at DashMap) to avoid racing with hole-punch setup.

3. event_rx competing consumers — Removed the event_rx drain from spawn_accept_loop entirely. The accept loop now relies on incoming_notify and scans the connections DashMap directly for newly-emitted addresses. Only poll() and accept_connection_direct() consume from event_rx, and accept_connection_direct only does a single try_recv per call.

HIGH

4. wire_id_from_addr — Replaced both copies with a deterministic byte encoding (version byte + raw IP octets + port, zero-padded to 32 bytes) in a shared function at src/shared.rs. No hashing involved. Both endpoint.rs and nat_traversal_api.rs delegate to crate::shared::wire_id_from_addr.

MEDIUM

5. try_send data loss — Instead of dropping data on a full channel, the reader task now spawns a bounded task that retries send().await with a 5-second timeout. Data is only dropped if the timeout expires.

6. cached_remote_addr — Renamed to initial_remote_addr. Added doc comment to remote_address() noting it returns the address at connection creation time and may not reflect connection migration.

7. Fire-and-forget fallback — Added comment explaining the intentional discard is for backward compatibility when hole_punch_tx is not configured. Quinn's internal idle timeout handles cleanup.

8. decode_rfc partial target_peer_id — Now returns Err(UnexpectedEnd) when has_peer_id == 1 but fewer than 32 bytes remain. Invalid flag values (not 0 or 1) are also rejected.

9. observed_address() try_lock — Added doc comment explaining that None may indicate lock contention rather than absence of observed address data.

10. Debug error! logs — Changed both FORWARDER_DEBUG calls to debug!().

Wire the existing NAT traversal protocol (PUNCH_ME_NOW coordination, hole-punching) into the actual connection path so nodes behind NAT can participate in the network. Key changes: - Unified accept path: remove competing accept_connections background task, add accept_connection_direct() as sole accept path for both Quinn incoming and outgoing hole-punch connections - Tracked hole-punch connections: forward addresses from Quinn driver via channel to NatTraversalEndpoint for full connection registration (DashMap, events, reader tasks) instead of fire-and-forget - Background accept loop with parallel handshakes to prevent serialized blocking on slow NAT connections - Stale coordination reset so repeated hole-punches work across client sessions - Dial deduplication to prevent concurrent hole-punch attempts to the same target from deadlocking the runtime - Contention reduction: cached remote_address, try_lock for observed_address, try_send in reader tasks, removed RwLock writes from send/recv hot paths - Increased send_ack_timeout (1s to 30s) for large chunk transfers over NAT-traversed connections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mickvandijke

Deep Review — Critical & High Issues

Greptile Triage

Greptile's 4 findings (DefaultHasher non-determinism, stale reaper disabled, error! logs, try_send data loss) were all addressed before the PR was pushed. The Greptile review appears to have run against an earlier commit. The wire_id_from_addr in shared.rs:261 now uses deterministic byte packing, is_connected() checks close_reason(), and the debug logs use debug! level.

CRITICAL

C1. `decode_auto` fallback corrupts buffer after partial RFC decode

src/frame/nat_traversal_unified.rs:393-401

decode_auto tries decode_rfc first. If decode_rfc fails partway through (after consuming round + seq + address bytes), the Buf cursor has already advanced. decode_legacy then starts from the wrong position, interpreting leftover bytes as a new frame.

A malicious peer can craft a frame with a valid RFC prefix but has_peer_id = 2 (line 381) to trigger this path, causing decode_legacy to parse attacker-controlled data as a PunchMeNow frame. This is exploitable in a P2P protocol where any peer can send arbitrary frames.

Fix: Either save/restore the buffer position before attempting RFC decode, or don't fall back to legacy after partial RFC consumption.

C2. `spawn_incoming_connection_forwarder` registers connections without reader tasks

src/p2p_endpoint.rs:2857-2878

Registers peers in connected_peers (line 2873) and emits PeerConnected (line 2874) but never spawns a reader task. The forwarder only receives a SocketAddr from the channel — it has no Connection handle. Consequences:

recv() never delivers data for these connections
Event-driven cleanup (reader-exit handler) never fires
Only the stale reaper can clean up, on a 10s interval

This same class of issue affects two more code paths:

try_hole_punch (line 1858-1869) — registers in connected_peers without reader task
send() lazy registration (lines ~2100-2133) — same problem

All three paths create "zombie" entries that appear connected but silently drop inbound data.

C3. Competing `accept()` consumers — race condition

src/nat_traversal_api.rs — lines 1542 vs 2716

Both spawn_accept_loop() (spawned unconditionally in new()) and accept_connections() (spawnable via start_listening()) call endpoint.accept() on the same InnerEndpoint. If both run, incoming connections are non-deterministically split between two code paths with different registration logic. The shared emitted_established_events DashSet further complicates this — if one path inserts an address first, the other path's event emission is suppressed.

Fix: Either remove accept_connections/start_listening entirely (since spawn_accept_loop replaces it), or guard against double-spawning with an AtomicBool.

HIGH

H1. Dial dedup key only uses first non-None address

src/p2p_endpoint.rs:1326

let target = target_ipv4.or(target_ipv6);

If caller A dials (Some(ipv4), None) and caller B dials (None, Some(ipv6)) to the same peer, they won't be deduplicated — both start parallel hole-punch sessions, which is exactly what dedup was designed to prevent.

H2. Thundering-herd retry after failed primary dial

src/p2p_endpoint.rs:1346-1349, 1366

When the primary dial fails, the pending_dials entry is removed (line 1366) and all waiters receive the error. They all fall through to retry simultaneously with no guard — re-creating the concurrent dial storm that dedup was meant to prevent.

Fix: Either re-insert the pending_dials entry so only one waiter retries, or add exponential backoff with jitter for waiters.

H3. `ConnectionMethod::HolePunched` incorrect for dedup waiters

src/p2p_endpoint.rs:1341-1343

Waiters always report ConnectionMethod::HolePunched { coordinator: target_addr } regardless of how the primary actually connected (could be DirectIPv4, DirectIPv6, or Relay). This gives callers incorrect telemetry about connection establishment.

Fix: Broadcast the actual ConnectionMethod along with the connection result.

H4. `poll_closed_connections` emits `ConnectionLost` on every poll tick

src/nat_traversal_api.rs:4866-4892

During the 5-second grace period before removal, ConnectionLost is emitted on every poll tick, not just once. This spams consumers with duplicate events.

Fix: Only emit on the first observation (when the closed_at entry is newly inserted), or track "already-emitted-lost" separately.

H5. `remove_connection` doesn't clean up `closed_at` DashMap

src/nat_traversal_api.rs:3777-3787

If a peer reconnects and disconnects again, the stale closed_at timestamp persists from the first disconnection. The new dead connection gets reaped immediately (if the old timestamp was >5s ago) instead of getting a fresh grace period.

Fix: Add self.closed_at.remove(addr) to remove_connection().

H6. Fire-and-forget connection leak

src/high_level/endpoint.rs:733-761

When hole_punch_tx is not configured, the fallback creates a QUIC connection and discards both handles (_ch, _conn). Relies on idle timeout (typically 30s) for cleanup. Under rapid InitiateHolePunch events (attack or busy network), zombie connections accumulate unboundedly.

Fix: Set a short idle timeout on fire-and-forget connections, or track them for explicit cleanup.

H7. Coordinator selection can pick the target itself

src/p2p_endpoint.rs:1395

config.known_peers.first() is used without filtering out the target address. If known_peers[0] happens to be the peer we're trying to reach, it becomes its own NAT traversal coordinator — which is nonsensical and will fail silently.

The connected_peers fallback path (line 1404) correctly filters the target, but the known_peers path does not.

Fix: Filter known_peers against the target address before selecting a coordinator.

H8. `bootstrap_nodes` grows unboundedly

src/nat_traversal_api.rs:3711-3727

Every add_connection() pushes a BootstrapNode if the address isn't already present. There is no cap and no eviction of stale entries. Over a long-running node's lifetime with many transient peers, this is a slow memory leak.

Note: Greptile's 4 original findings were all resolved in the current commit. This review covers issues not flagged by Greptile.

jacderida · 2026-03-25T19:09:17Z

This work will be explored as part of another branch to achieve continuous uploads.

jacderida mentioned this pull request Mar 24, 2026

fix: register unknown channels on the fly for hole-punched connections WithAutonomi/saorsa-core#58

Closed

2 tasks

greptile-apps Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/endpoint.rs

Comment thread src/p2p_endpoint.rs Outdated

Comment thread src/p2p_endpoint.rs

jacderida force-pushed the feat-nat_traversal_attempts branch from ee6cf6f to 688c9ca Compare March 24, 2026 01:05

mickvandijke requested changes Mar 24, 2026

View reviewed changes

jacderida force-pushed the feat-nat_traversal_attempts branch 2 times, most recently from 211e396 to 758f16c Compare March 24, 2026 16:03

jacderida force-pushed the feat-nat_traversal_attempts branch 2 times, most recently from 82d872f to 56fdb2b Compare March 24, 2026 16:20

jacderida force-pushed the feat-nat_traversal_attempts branch from 56fdb2b to ebe824b Compare March 24, 2026 16:28

mickvandijke requested changes Mar 25, 2026

View reviewed changes

jacderida closed this Mar 25, 2026

dirvine mentioned this pull request Mar 29, 2026

cleanup: remove misleading forward_datagram stub in relay_server.rs #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wire NAT traversal into connection path with unified accept loop#25

fix: wire NAT traversal into connection path with unified accept loop#25
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:feat-nat_traversal_attempts

jacderida commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickvandijke left a comment

Uh oh!

jacderida commented Mar 24, 2026

Uh oh!

mickvandijke left a comment

Uh oh!

jacderida commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacderida commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickvandijke left a comment

Choose a reason for hiding this comment

Deep Review: PR #25 — Wire NAT Traversal into Connection Path

CRITICAL

1. send_ack_timeout increase from 1s → 30s masks connection failures

2. Stale Connection Reaper Permanently Disabled — Unbounded DashMap Growth

3. event_rx Has Competing Consumers — Events Silently Lost

HIGH

4. wire_id_from_addr Uses DefaultHasher — Fragile Wire Protocol

MEDIUM

5. try_send in Reader Task Silently Drops Data

6. cached_remote_addr Stale After Connection Migration

7. Fire-and-Forget Fallback Creates Orphaned Connections

8. decode_rfc Corrupts Stream on Partial target_peer_id

9. observed_address() Silently Returns None Under Lock Contention

10. Debug error! Logs Left in Production Code

Uh oh!

jacderida commented Mar 24, 2026

CRITICAL

HIGH

MEDIUM

Uh oh!

mickvandijke left a comment

Choose a reason for hiding this comment

Deep Review — Critical & High Issues

Greptile Triage

CRITICAL

C1. decode_auto fallback corrupts buffer after partial RFC decode

C2. spawn_incoming_connection_forwarder registers connections without reader tasks

C3. Competing accept() consumers — race condition

HIGH

H1. Dial dedup key only uses first non-None address

H2. Thundering-herd retry after failed primary dial

H3. ConnectionMethod::HolePunched incorrect for dedup waiters

H4. poll_closed_connections emits ConnectionLost on every poll tick

H5. remove_connection doesn't clean up closed_at DashMap

H6. Fire-and-forget connection leak

H7. Coordinator selection can pick the target itself

H8. bootstrap_nodes grows unboundedly

Uh oh!

jacderida commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacderida commented Mar 24, 2026 •

edited

Loading

1. `send_ack_timeout` increase from 1s → 30s masks connection failures

3. `event_rx` Has Competing Consumers — Events Silently Lost

4. `wire_id_from_addr` Uses `DefaultHasher` — Fragile Wire Protocol

5. `try_send` in Reader Task Silently Drops Data

6. `cached_remote_addr` Stale After Connection Migration

8. `decode_rfc` Corrupts Stream on Partial `target_peer_id`

9. `observed_address()` Silently Returns `None` Under Lock Contention

10. Debug `error!` Logs Left in Production Code

C1. `decode_auto` fallback corrupts buffer after partial RFC decode

C2. `spawn_incoming_connection_forwarder` registers connections without reader tasks

C3. Competing `accept()` consumers — race condition

H3. `ConnectionMethod::HolePunched` incorrect for dedup waiters

H4. `poll_closed_connections` emits `ConnectionLost` on every poll tick

H5. `remove_connection` doesn't clean up `closed_at` DashMap

H8. `bootstrap_nodes` grows unboundedly