Skip to content

[PLACEHOLDER — needs review] Variable multi-minute iroh dial latency from edge to a freshly-started listen node #12

Description

@drewr

⚠ Placeholder — unverified. Opened by Claude from CLI-side observations during a debugging session. @drewr — please verify against edge-side metrics/logs before triaging. I have no direct visibility into iroh-gateway runtime telemetry; everything below is inferred from the agent-side perspective. Adjust scope / close as not-a-bug if the read is wrong. The companion app-side PR is datum-cloud/app#130 and the EnvoyPatchPolicy-reconcile placeholder (orthogonal, already narrowed) is datum-cloud/network-services-operator#175.

Symptom

After datum-connect tunnel listen reports all six controller conditions Ready at ~0.1s on a resume, the public proxy URL serves 5xx for an intermittent and surprisingly long window before the data plane carries traffic. The duration varies dramatically run-to-run on the same machine and same listen_key.

Two runs from the same session, ~15 minutes apart, no config or identity changes between them:

Run A — connect-debug-1.log (21:37:31 process start):

Setting up tunnel...
  ✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ TLS certificate issued (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ connector ready (0.1s) [Connector/datum-connect-mhxj5]
  ✓ iroh DNS published (0.1s) [Connector/datum-connect-mhxj5]
  ✓ route programmed (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ envoy metadata propagated (0.1s) [HTTPProxy/tunnel-gchhg]
Verifying connectivity...
  ✓ origin reachable (0.3s)
  ✓ proxy responding (0.3s) [https://career-replay-kyxnz.datumproxy.net]: HTTP 200
Tunnel ready after 0 sec

Iroh log on the agent at success: at 21:37:32.241, 530 ms after relay handshake completed, the first inbound:

INFO receive_relay{src=202949a59b}: iroh::magicsock::endpoint_map:
     inserting new endpoint in EndpointMap endpoint=202949a59b
     relay_url=Some(RelayUrl("https://iroh-relay.us-east-1.datumconnect.net./"))
     source=relay

Run B — connect-debug-2.log (21:53:36 process start, same machine, same listen_key, same connector):

Setting up tunnel...
  ✓ <all six conditions> (0.1s)
Verifying connectivity...
  ✓ origin reachable (0.3s)
  … proxy not responding (0s)   HTTP 503
  … proxy not responding (10s)  HTTP 503
  … proxy not responding (20s)  HTTP 503
  … proxy not responding (30s)  HTTP 503
  … proxy not responding (40s)  HTTP 503
  … proxy not responding (50s)  HTTP 503
  … proxy not responding (60s)  HTTP 503
  … proxy not responding (70s)  HTTP 503
  … proxy not responding (80s)  HTTP 503
  … proxy not responding (90s)  HTTP 500
  … proxy not responding (100s) HTTP 500
  … proxy not responding (110s) HTTP 500
  … proxy not responding (120s) HTTP 500
  … proxy not responding (130s) HTTP 500
  … proxy not responding (140s) HTTP 500
  … proxy not responding (150s) HTTP 500
  … proxy not responding (160s) HTTP 500
  ✓ proxy responding (170.4s) HTTP 200
Tunnel ready after 170 sec

Agent-side iroh log right before success: at 21:56:22.866, after 166 s of silence:

WARN  iroh::magicsock: UDP recv quic packets: no endpoint state found,
      skipping src=104.225.115.21:40878 count=1 len=1200

INFO  receive_relay{src=f9e253c340}: iroh::magicsock::endpoint_map:
      inserting new endpoint in EndpointMap endpoint=f9e253c340
      source=relay

(A direct-UDP attempt landed 23 ms before the relay packet — so the network path was reachable; the edge just hadn't tried until then.)

What this rules out

We've already eliminated several plausible-sounding explanations:

  • Not the control plane. All six HTTPProxy + Connector conditions flip True with observedGeneration matching metadata.generation at ~0.1s. The CLI-side progress streaming (datum-cloud/app@c39d9ee) and the observedGeneration-aware Ready check (@d65ec4d) both confirm the operator side is fully reconciled before the 5xx window even starts.
  • Not a client-side spec patch causing Envoy re-reconcile. Both runs print Tunnel already configured correctly.update_active was not called and HTTPProxy.spec was not touched. The CLI-side idempotency landing in datum-cloud/app@0311960 makes this guarantee explicit at the lib boundary going forward.
  • Not warm-vs-cold session state. Same code path, same listen_key, 15 minutes apart, fully opposite outcomes (530ms vs 166s). Whatever "warm cache" model would predict here doesn't fit.
  • Not Envoy outlier-detection in a simple sense. The 5xx ordering inverts between runs: Run B is 503 → 500 → 200, while a prior 140-second run was 500 → 503 → 200. A clean "eject after consecutive 5xx" story would predict one stable ordering.

What this points at

The latency is entirely between "iroh-gateway has the Envoy cluster config naming our iroh endpoint" and "iroh-gateway has actually opened a stream to our iroh peer." The agent-side receive_relay event is the success marker on every observation. iroh-gateway is the only component in that gap.

Plausible mechanisms, in rough order of likelihood (each needs edge-side verification):

  1. Lazy dial timing / discovery delay. If iroh-gateway only resolves the target endpoint and opens a stream on first inbound request (or on a periodic refresh schedule), the wait until that first dial attempt determines the user-visible wait. Variability would come from whatever schedules / debounces the dial.
  2. Stale iroh-discovery cache. If iroh-gateway caches resolved iroh endpoint → relay/addresses with a TTL, a request that lands during a stale window goes nowhere useful until the cache refreshes. The 166s number is suspiciously close to common 2–3 minute caching defaults.
  3. Per-instance state asymmetry. The two runs see different edge peer IDs (202949a59b and f9e253c340 in run A, only f9e253c340 initially in run B). If iroh-gateway is horizontally scaled and only some instances have us cached, latency depends on which instance handles the request.
  4. Outlier-detection / circuit-breaker state held across requests with non-deterministic re-evaluation. Could explain both the 5xx-code-flipping during the window and the variable total duration. Default Envoy base_ejection_time is 30s; if compounded across consecutive ejections, multi-minute windows would appear.

What would confirm

Things that probably aren't visible from iroh-gateway's current logs but would be diagnostic:

  • For each request that returned 5xx during a wait window: did iroh-gateway attempt a dial to our endpoint? Did the dial fail synchronously, or did it succeed but the underlying stream never carried bytes?
  • iroh-gateway-side metrics for time-from-first-request-to-first-successful-stream per (client_request_id, target_endpoint_id).
  • Whether iroh-gateway is sharing a single iroh node across all upstream targets (one cache, one relay connection, etc.) or per-target.
  • Any retry / outlier behavior in src/gateway.rs or src/endpoint.rs that gates when the upstream is re-tried after a failure.

Concrete repro context

  • Agent listen_key (constant across both runs) → iroh endpoint id: c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8
  • Connector: datum-connect-mhxj5 in project drewr-y4nd1b
  • HTTPProxy: tunnel-gchhg
  • Public hostname: career-replay-kyxnz.datumproxy.net
  • Home relay: https://iroh-relay.us-east-1.datumconnect.net./
  • Run A: 2026-06-07T21:37:31Z → ready at 21:37:32 (~530ms after relay handshake)
  • Run B: 2026-06-07T21:53:36Z → ready at 21:56:27 (170s after process start)

Full logs are in the app-side investigator's /tmp/connect-debug-1.log and /tmp/connect-debug-2.log@drewr can request them or recreate the conditions trivially with datum-connect tunnel listen.

Why this issue lives here

  • Not in datum-cloud/network-services-operator because the operator is done programming Envoy at t≈0.1s. The companion issue datum-cloud/network-services-operator#175 is narrowed to a separate fresh-tunnel reconcile-latency case and shouldn't be conflated.
  • Not in n0-computer/iroh-proxy-utils because that's a generic primitives crate. If investigation here finds the root cause is in one of those primitives (e.g. a default timeout, missing cache invalidation hook), escalating upstream from this issue's discussion is the appropriate path — but the workload-specific symptom report belongs against the deployed service that's exhibiting it.

Action for @drewr

Verify the symptom against iroh-gateway logs/metrics, confirm or reject the hypotheses, and either close-as-not-a-bug if my read is wrong or retitle/scope this once root cause is understood. If the fix turns out to live in iroh-proxy-utils, link a follow-up issue there from this one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions