[PLACEHOLDER — needs review] Variable multi-minute iroh dial latency from edge to a freshly-started listen node

> **⚠ Placeholder — unverified.** Opened by Claude from CLI-side observations during a debugging session. **@drewr — please verify against edge-side metrics/logs before triaging.** I have no direct visibility into iroh-gateway runtime telemetry; everything below is inferred from the agent-side perspective. Adjust scope / close as not-a-bug if the read is wrong. The companion app-side PR is [datum-cloud/app#130](https://github.com/datum-cloud/app/pull/130) and the EnvoyPatchPolicy-reconcile placeholder (orthogonal, already narrowed) is [datum-cloud/network-services-operator#175](https://github.com/datum-cloud/network-services-operator/issues/175).

## Symptom

After `datum-connect tunnel listen` reports all six controller conditions Ready at ~0.1s on a resume, the public proxy URL serves 5xx for an intermittent and surprisingly long window before the data plane carries traffic. The duration varies dramatically run-to-run on the same machine and same listen_key.

Two runs from the same session, ~15 minutes apart, no config or identity changes between them:

**Run A — `connect-debug-1.log` (21:37:31 process start):**

```
Setting up tunnel...
  ✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ TLS certificate issued (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ connector ready (0.1s) [Connector/datum-connect-mhxj5]
  ✓ iroh DNS published (0.1s) [Connector/datum-connect-mhxj5]
  ✓ route programmed (0.1s) [HTTPProxy/tunnel-gchhg]
  ✓ envoy metadata propagated (0.1s) [HTTPProxy/tunnel-gchhg]
Verifying connectivity...
  ✓ origin reachable (0.3s)
  ✓ proxy responding (0.3s) [https://career-replay-kyxnz.datumproxy.net]: HTTP 200
Tunnel ready after 0 sec
```

Iroh log on the agent at success: at 21:37:32.241, **530 ms** after relay handshake completed, the first inbound:
```
INFO receive_relay{src=202949a59b}: iroh::magicsock::endpoint_map:
     inserting new endpoint in EndpointMap endpoint=202949a59b
     relay_url=Some(RelayUrl("https://iroh-relay.us-east-1.datumconnect.net./"))
     source=relay
```

**Run B — `connect-debug-2.log` (21:53:36 process start, same machine, same listen_key, same connector):**

```
Setting up tunnel...
  ✓ <all six conditions> (0.1s)
Verifying connectivity...
  ✓ origin reachable (0.3s)
  … proxy not responding (0s)   HTTP 503
  … proxy not responding (10s)  HTTP 503
  … proxy not responding (20s)  HTTP 503
  … proxy not responding (30s)  HTTP 503
  … proxy not responding (40s)  HTTP 503
  … proxy not responding (50s)  HTTP 503
  … proxy not responding (60s)  HTTP 503
  … proxy not responding (70s)  HTTP 503
  … proxy not responding (80s)  HTTP 503
  … proxy not responding (90s)  HTTP 500
  … proxy not responding (100s) HTTP 500
  … proxy not responding (110s) HTTP 500
  … proxy not responding (120s) HTTP 500
  … proxy not responding (130s) HTTP 500
  … proxy not responding (140s) HTTP 500
  … proxy not responding (150s) HTTP 500
  … proxy not responding (160s) HTTP 500
  ✓ proxy responding (170.4s) HTTP 200
Tunnel ready after 170 sec
```

Agent-side iroh log right before success: at 21:56:22.866, after **166 s** of silence:
```
WARN  iroh::magicsock: UDP recv quic packets: no endpoint state found,
      skipping src=104.225.115.21:40878 count=1 len=1200

INFO  receive_relay{src=f9e253c340}: iroh::magicsock::endpoint_map:
      inserting new endpoint in EndpointMap endpoint=f9e253c340
      source=relay
```

(A direct-UDP attempt landed 23 ms before the relay packet — so the network path was reachable; the edge just hadn't tried until then.)

## What this rules out

We've already eliminated several plausible-sounding explanations:

- **Not the control plane.** All six HTTPProxy + Connector conditions flip True with `observedGeneration` matching `metadata.generation` at ~0.1s. The CLI-side progress streaming ([datum-cloud/app@c39d9ee](https://github.com/datum-cloud/app/commit/c39d9ee)) and the observedGeneration-aware Ready check ([@d65ec4d](https://github.com/datum-cloud/app/commit/d65ec4d)) both confirm the operator side is fully reconciled before the 5xx window even starts.
- **Not a client-side spec patch causing Envoy re-reconcile.** Both runs print `Tunnel already configured correctly.` — `update_active` was not called and `HTTPProxy.spec` was not touched. The CLI-side idempotency landing in [datum-cloud/app@0311960](https://github.com/datum-cloud/app/commit/0311960) makes this guarantee explicit at the lib boundary going forward.
- **Not warm-vs-cold session state.** Same code path, same listen_key, 15 minutes apart, fully opposite outcomes (530ms vs 166s). Whatever "warm cache" model would predict here doesn't fit.
- **Not Envoy outlier-detection in a simple sense.** The 5xx ordering inverts between runs: Run B is `503 → 500 → 200`, while a prior 140-second run was `500 → 503 → 200`. A clean "eject after consecutive 5xx" story would predict one stable ordering.

## What this points at

The latency is entirely between **"iroh-gateway has the Envoy cluster config naming our iroh endpoint"** and **"iroh-gateway has actually opened a stream to our iroh peer."** The agent-side `receive_relay` event is the success marker on every observation. iroh-gateway is the only component in that gap.

Plausible mechanisms, in rough order of likelihood (each needs edge-side verification):

1. **Lazy dial timing / discovery delay.** If iroh-gateway only resolves the target endpoint and opens a stream on first inbound request (or on a periodic refresh schedule), the wait until that first dial attempt determines the user-visible wait. Variability would come from whatever schedules / debounces the dial.
2. **Stale iroh-discovery cache.** If iroh-gateway caches resolved iroh endpoint → relay/addresses with a TTL, a request that lands during a stale window goes nowhere useful until the cache refreshes. The 166s number is suspiciously close to common 2–3 minute caching defaults.
3. **Per-instance state asymmetry.** The two runs see *different* edge peer IDs (`202949a59b` and `f9e253c340` in run A, only `f9e253c340` initially in run B). If iroh-gateway is horizontally scaled and only some instances have us cached, latency depends on which instance handles the request.
4. **Outlier-detection / circuit-breaker state held across requests with non-deterministic re-evaluation.** Could explain both the 5xx-code-flipping during the window and the variable total duration. Default Envoy `base_ejection_time` is 30s; if compounded across consecutive ejections, multi-minute windows would appear.

## What would confirm

Things that probably aren't visible from iroh-gateway's current logs but would be diagnostic:

- For each request that returned 5xx during a wait window: did iroh-gateway *attempt* a dial to our endpoint? Did the dial fail synchronously, or did it succeed but the underlying stream never carried bytes?
- iroh-gateway-side metrics for time-from-first-request-to-first-successful-stream per `(client_request_id, target_endpoint_id)`.
- Whether iroh-gateway is sharing a single iroh node across all upstream targets (one cache, one relay connection, etc.) or per-target.
- Any retry / outlier behavior in `src/gateway.rs` or `src/endpoint.rs` that gates when the upstream is re-tried after a failure.

## Concrete repro context

- Agent listen_key (constant across both runs) → iroh endpoint id: `c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8`
- Connector: `datum-connect-mhxj5` in project `drewr-y4nd1b`
- HTTPProxy: `tunnel-gchhg`
- Public hostname: `career-replay-kyxnz.datumproxy.net`
- Home relay: `https://iroh-relay.us-east-1.datumconnect.net./`
- Run A: 2026-06-07T21:37:31Z → ready at 21:37:32 (~530ms after relay handshake)
- Run B: 2026-06-07T21:53:36Z → ready at 21:56:27 (170s after process start)

Full logs are in the app-side investigator's `/tmp/connect-debug-1.log` and `/tmp/connect-debug-2.log` — @drewr can request them or recreate the conditions trivially with `datum-connect tunnel listen`.

## Why this issue lives here

- **Not in `datum-cloud/network-services-operator`** because the operator is done programming Envoy at t≈0.1s. The companion issue [datum-cloud/network-services-operator#175](https://github.com/datum-cloud/network-services-operator/issues/175) is narrowed to a separate fresh-tunnel reconcile-latency case and shouldn't be conflated.
- **Not in `n0-computer/iroh-proxy-utils`** because that's a generic primitives crate. If investigation here finds the root cause is in one of those primitives (e.g. a default timeout, missing cache invalidation hook), escalating upstream from this issue's discussion is the appropriate path — but the workload-specific symptom report belongs against the deployed service that's exhibiting it.

## Action for @drewr

Verify the symptom against iroh-gateway logs/metrics, confirm or reject the hypotheses, and either close-as-not-a-bug if my read is wrong or retitle/scope this once root cause is understood. If the fix turns out to live in `iroh-proxy-utils`, link a follow-up issue there from this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PLACEHOLDER — needs review] Variable multi-minute iroh dial latency from edge to a freshly-started listen node #12

Symptom

What this rules out

What this points at

What would confirm

Concrete repro context

Why this issue lives here

Action for @drewr

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[PLACEHOLDER — needs review] Variable multi-minute iroh dial latency from edge to a freshly-started listen node #12

Description

Symptom

What this rules out

What this points at

What would confirm

Concrete repro context

Why this issue lives here

Action for @drewr

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions