Skip to content

feat(reachability): gate relay publication with canary quorum#136

Open
mickvandijke wants to merge 6 commits into
WithAutonomi:rc-2026.6.2from
mickvandijke:feat/relay-canary-gate-rc-2026.6.2
Open

feat(reachability): gate relay publication with canary quorum#136
mickvandijke wants to merge 6 commits into
WithAutonomi:rc-2026.6.2from
mickvandijke:feat/relay-canary-gate-rc-2026.6.2

Conversation

@mickvandijke

@mickvandijke mickvandijke commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a canary-gated relay publication flow for MASQUE relay acquisition. A newly acquired relay address is now verified by randomized independent non-close witnesses before it is published into the DHT self-record, keeping the DHT as a peer phonebook while ensuring relay reachability is proven before gossip.

Changes

  • Added reachability::canary, an internal versioned request/response protocol (relay-canary-v1) that asks selected witnesses to cold-dial a freshly allocated relay address and verify the authenticated target identity.
  • Selects randomized non-close witnesses, excluding the target, relayer, target close group, duplicate source IPs, and peers sharing the relay IP.
  • Wired relay canary requests through DhtNetworkManager using the generic transport request/response envelope rather than a DHT operation.
  • Added per-source canary in-flight limiting so one peer cannot fan out concurrent cold relay dials through a witness.
  • Updated the acquisition driver to publish relay addresses only after canary quorum succeeds.
  • Kept canary-rejected relayers excluded across ordinary acquisition failures; the exclusion set is cleared only after a verified relay or insufficient witness coverage.
  • Ignored transport ADD_ADDRESS relay hints in the DHT bridge so unverified relay allocations cannot bypass the sequenced self-record path.
  • Added request cleanup via an active request guard and made request/response timeouts surface as P2PError::Timeout.
  • Bounded witness-side relay cold dials so unreachable relays return DialFailed before the handler timeout, and classified canary request timeouts as failed witness attempts rather than ineligible witnesses.
  • Added driver-level tests for rejected-relayer retention and structured logs for witness availability, ineligible responses, and routing-table coverage.

Follow-up

SemVer

  • SemVer: minor
  • Rationale: this adds a new relay verification behavior and changes when relay addresses become publishable, without introducing a public API break.
  • Follow-up refinement commit: SemVer: patch.

Validation

  • cargo fmt --all -- --check
  • git diff --check
  • cargo test reachability:: --all-features
  • cargo clippy --all-targets --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used
  • cargo test --all-features

Target

  • Base branch: rc-2026.6.2

Add a relay canary request/response protocol so newly acquired MASQUE relay addresses are cold-dialed by independent close-group witnesses before they enter DHT self-record gossip.

Keep legacy ADD_ADDRESS relay hints out of DHT records, retain canary-rejected relayers across ordinary acquisition failures, and return typed request timeouts so unreachable relay probes count as failed witness attempts.

SemVer: minor
mickvandijke and others added 4 commits June 18, 2026 14:04
Align the relay canary docs with the randomized non-close witness selection, version the internal canary request/response topic, and add driver-level tests for canary rejection retention across acquisition failures.

Add structured rollout fields for witness availability and ineligible responses, and link the transport cleanup follow-up for rejected MASQUE allocations.

SemVer: patch
…elayer knowledge

Witnesses previously refused to probe a relay address unless they already
held a Direct address for the named relayer (relay_canary_addr_matches_relayer_record).
Witnesses are chosen as non-close peers while the relayer is drawn from the
target's close group, so at scale a random witness almost never knows the
relayer: canaries returned RelayerUnknown, quorum fell to InsufficientWitnesses,
and relays were never published — leaving NAT'd nodes unreachable.

The relayer-knowledge check was only an anti-amplification rail, not part of
verification (the cold dial plus identity check needs just the relay address and
the target identity). Replace it with a per-source token-bucket rate limiter
(one dial per 10s per source, reusing crate::rate_limit::Engine, LRU-bounded),
which also subsumes the former per-source in-flight concurrency guard. Throttled
sources receive WitnessRateLimited (Ineligible), so they never count as a probe
failure and cannot trigger a false relay rejection.

SemVer: patch

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The canary exclusion set was preserved on RelayAcquisitionOutcome::Failed but
cleared on every other outcome. If the only close Direct candidate is an
excluded relayer, acquisition fails every round and that relayer is never
retried, leaving the node permanently relay-less.

Clear the set on AcquisitionFailed too, matching the InsufficientWitnesses
policy: exclusions now accumulate only across a contiguous run of Rejected
verdicts and reset on any other outcome. Backoff rate-limits retries and a
still-unreachable relay is simply re-excluded the next round.

SemVer: patch

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Do not count request-level relay canary errors as failed relay probes. In mixed-version networks an older witness can authenticate and ignore /rr/relay-canary-v1, producing a timeout without ever evaluating the relay. Keep explicit canary-capable DialFailed, IdentityExchangeFailed, and IdentityMismatch responses as eligible relay failures.

SemVer: patch
@mickvandijke mickvandijke force-pushed the feat/relay-canary-gate-rc-2026.6.2 branch from fe26eef to 5f4c745 Compare June 18, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant