⚠ Placeholder — unverified. Opened by Claude from CLI-side observations during a debugging session. @drewr — please verify against edge-side metrics/logs before triaging. I have no direct visibility into iroh-gateway runtime telemetry; everything below is inferred from the agent-side perspective. Adjust scope / close as not-a-bug if the read is wrong. The companion app-side PR is datum-cloud/app#130 and the EnvoyPatchPolicy-reconcile placeholder (orthogonal, already narrowed) is datum-cloud/network-services-operator#175.
Symptom
After datum-connect tunnel listen reports all six controller conditions Ready at ~0.1s on a resume, the public proxy URL serves 5xx for an intermittent and surprisingly long window before the data plane carries traffic. The duration varies dramatically run-to-run on the same machine and same listen_key.
Two runs from the same session, ~15 minutes apart, no config or identity changes between them:
Run A — connect-debug-1.log (21:37:31 process start):
Setting up tunnel...
✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg]
✓ TLS certificate issued (0.1s) [HTTPProxy/tunnel-gchhg]
✓ connector ready (0.1s) [Connector/datum-connect-mhxj5]
✓ iroh DNS published (0.1s) [Connector/datum-connect-mhxj5]
✓ route programmed (0.1s) [HTTPProxy/tunnel-gchhg]
✓ envoy metadata propagated (0.1s) [HTTPProxy/tunnel-gchhg]
Verifying connectivity...
✓ origin reachable (0.3s)
✓ proxy responding (0.3s) [https://career-replay-kyxnz.datumproxy.net]: HTTP 200
Tunnel ready after 0 sec
Iroh log on the agent at success: at 21:37:32.241, 530 ms after relay handshake completed, the first inbound:
INFO receive_relay{src=202949a59b}: iroh::magicsock::endpoint_map:
inserting new endpoint in EndpointMap endpoint=202949a59b
relay_url=Some(RelayUrl("https://iroh-relay.us-east-1.datumconnect.net./"))
source=relay
Run B — connect-debug-2.log (21:53:36 process start, same machine, same listen_key, same connector):
Setting up tunnel...
✓ <all six conditions> (0.1s)
Verifying connectivity...
✓ origin reachable (0.3s)
… proxy not responding (0s) HTTP 503
… proxy not responding (10s) HTTP 503
… proxy not responding (20s) HTTP 503
… proxy not responding (30s) HTTP 503
… proxy not responding (40s) HTTP 503
… proxy not responding (50s) HTTP 503
… proxy not responding (60s) HTTP 503
… proxy not responding (70s) HTTP 503
… proxy not responding (80s) HTTP 503
… proxy not responding (90s) HTTP 500
… proxy not responding (100s) HTTP 500
… proxy not responding (110s) HTTP 500
… proxy not responding (120s) HTTP 500
… proxy not responding (130s) HTTP 500
… proxy not responding (140s) HTTP 500
… proxy not responding (150s) HTTP 500
… proxy not responding (160s) HTTP 500
✓ proxy responding (170.4s) HTTP 200
Tunnel ready after 170 sec
Agent-side iroh log right before success: at 21:56:22.866, after 166 s of silence:
WARN iroh::magicsock: UDP recv quic packets: no endpoint state found,
skipping src=104.225.115.21:40878 count=1 len=1200
INFO receive_relay{src=f9e253c340}: iroh::magicsock::endpoint_map:
inserting new endpoint in EndpointMap endpoint=f9e253c340
source=relay
(A direct-UDP attempt landed 23 ms before the relay packet — so the network path was reachable; the edge just hadn't tried until then.)
What this rules out
We've already eliminated several plausible-sounding explanations:
- Not the control plane. All six HTTPProxy + Connector conditions flip True with
observedGeneration matching metadata.generation at ~0.1s. The CLI-side progress streaming (datum-cloud/app@c39d9ee) and the observedGeneration-aware Ready check (@d65ec4d) both confirm the operator side is fully reconciled before the 5xx window even starts.
- Not a client-side spec patch causing Envoy re-reconcile. Both runs print
Tunnel already configured correctly. — update_active was not called and HTTPProxy.spec was not touched. The CLI-side idempotency landing in datum-cloud/app@0311960 makes this guarantee explicit at the lib boundary going forward.
- Not warm-vs-cold session state. Same code path, same listen_key, 15 minutes apart, fully opposite outcomes (530ms vs 166s). Whatever "warm cache" model would predict here doesn't fit.
- Not Envoy outlier-detection in a simple sense. The 5xx ordering inverts between runs: Run B is
503 → 500 → 200, while a prior 140-second run was 500 → 503 → 200. A clean "eject after consecutive 5xx" story would predict one stable ordering.
What this points at
The latency is entirely between "iroh-gateway has the Envoy cluster config naming our iroh endpoint" and "iroh-gateway has actually opened a stream to our iroh peer." The agent-side receive_relay event is the success marker on every observation. iroh-gateway is the only component in that gap.
Plausible mechanisms, in rough order of likelihood (each needs edge-side verification):
- Lazy dial timing / discovery delay. If iroh-gateway only resolves the target endpoint and opens a stream on first inbound request (or on a periodic refresh schedule), the wait until that first dial attempt determines the user-visible wait. Variability would come from whatever schedules / debounces the dial.
- Stale iroh-discovery cache. If iroh-gateway caches resolved iroh endpoint → relay/addresses with a TTL, a request that lands during a stale window goes nowhere useful until the cache refreshes. The 166s number is suspiciously close to common 2–3 minute caching defaults.
- Per-instance state asymmetry. The two runs see different edge peer IDs (
202949a59b and f9e253c340 in run A, only f9e253c340 initially in run B). If iroh-gateway is horizontally scaled and only some instances have us cached, latency depends on which instance handles the request.
- Outlier-detection / circuit-breaker state held across requests with non-deterministic re-evaluation. Could explain both the 5xx-code-flipping during the window and the variable total duration. Default Envoy
base_ejection_time is 30s; if compounded across consecutive ejections, multi-minute windows would appear.
What would confirm
Things that probably aren't visible from iroh-gateway's current logs but would be diagnostic:
- For each request that returned 5xx during a wait window: did iroh-gateway attempt a dial to our endpoint? Did the dial fail synchronously, or did it succeed but the underlying stream never carried bytes?
- iroh-gateway-side metrics for time-from-first-request-to-first-successful-stream per
(client_request_id, target_endpoint_id).
- Whether iroh-gateway is sharing a single iroh node across all upstream targets (one cache, one relay connection, etc.) or per-target.
- Any retry / outlier behavior in
src/gateway.rs or src/endpoint.rs that gates when the upstream is re-tried after a failure.
Concrete repro context
- Agent listen_key (constant across both runs) → iroh endpoint id:
c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8
- Connector:
datum-connect-mhxj5 in project drewr-y4nd1b
- HTTPProxy:
tunnel-gchhg
- Public hostname:
career-replay-kyxnz.datumproxy.net
- Home relay:
https://iroh-relay.us-east-1.datumconnect.net./
- Run A: 2026-06-07T21:37:31Z → ready at 21:37:32 (~530ms after relay handshake)
- Run B: 2026-06-07T21:53:36Z → ready at 21:56:27 (170s after process start)
Full logs are in the app-side investigator's /tmp/connect-debug-1.log and /tmp/connect-debug-2.log — @drewr can request them or recreate the conditions trivially with datum-connect tunnel listen.
Why this issue lives here
- Not in
datum-cloud/network-services-operator because the operator is done programming Envoy at t≈0.1s. The companion issue datum-cloud/network-services-operator#175 is narrowed to a separate fresh-tunnel reconcile-latency case and shouldn't be conflated.
- Not in
n0-computer/iroh-proxy-utils because that's a generic primitives crate. If investigation here finds the root cause is in one of those primitives (e.g. a default timeout, missing cache invalidation hook), escalating upstream from this issue's discussion is the appropriate path — but the workload-specific symptom report belongs against the deployed service that's exhibiting it.
Verify the symptom against iroh-gateway logs/metrics, confirm or reject the hypotheses, and either close-as-not-a-bug if my read is wrong or retitle/scope this once root cause is understood. If the fix turns out to live in iroh-proxy-utils, link a follow-up issue there from this one.
Symptom
After
datum-connect tunnel listenreports all six controller conditions Ready at ~0.1s on a resume, the public proxy URL serves 5xx for an intermittent and surprisingly long window before the data plane carries traffic. The duration varies dramatically run-to-run on the same machine and same listen_key.Two runs from the same session, ~15 minutes apart, no config or identity changes between them:
Run A —
connect-debug-1.log(21:37:31 process start):Iroh log on the agent at success: at 21:37:32.241, 530 ms after relay handshake completed, the first inbound:
Run B —
connect-debug-2.log(21:53:36 process start, same machine, same listen_key, same connector):Agent-side iroh log right before success: at 21:56:22.866, after 166 s of silence:
(A direct-UDP attempt landed 23 ms before the relay packet — so the network path was reachable; the edge just hadn't tried until then.)
What this rules out
We've already eliminated several plausible-sounding explanations:
observedGenerationmatchingmetadata.generationat ~0.1s. The CLI-side progress streaming (datum-cloud/app@c39d9ee) and the observedGeneration-aware Ready check (@d65ec4d) both confirm the operator side is fully reconciled before the 5xx window even starts.Tunnel already configured correctly.—update_activewas not called andHTTPProxy.specwas not touched. The CLI-side idempotency landing in datum-cloud/app@0311960 makes this guarantee explicit at the lib boundary going forward.503 → 500 → 200, while a prior 140-second run was500 → 503 → 200. A clean "eject after consecutive 5xx" story would predict one stable ordering.What this points at
The latency is entirely between "iroh-gateway has the Envoy cluster config naming our iroh endpoint" and "iroh-gateway has actually opened a stream to our iroh peer." The agent-side
receive_relayevent is the success marker on every observation. iroh-gateway is the only component in that gap.Plausible mechanisms, in rough order of likelihood (each needs edge-side verification):
202949a59bandf9e253c340in run A, onlyf9e253c340initially in run B). If iroh-gateway is horizontally scaled and only some instances have us cached, latency depends on which instance handles the request.base_ejection_timeis 30s; if compounded across consecutive ejections, multi-minute windows would appear.What would confirm
Things that probably aren't visible from iroh-gateway's current logs but would be diagnostic:
(client_request_id, target_endpoint_id).src/gateway.rsorsrc/endpoint.rsthat gates when the upstream is re-tried after a failure.Concrete repro context
c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8datum-connect-mhxj5in projectdrewr-y4nd1btunnel-gchhgcareer-replay-kyxnz.datumproxy.nethttps://iroh-relay.us-east-1.datumconnect.net./Full logs are in the app-side investigator's
/tmp/connect-debug-1.logand/tmp/connect-debug-2.log— @drewr can request them or recreate the conditions trivially withdatum-connect tunnel listen.Why this issue lives here
datum-cloud/network-services-operatorbecause the operator is done programming Envoy at t≈0.1s. The companion issue datum-cloud/network-services-operator#175 is narrowed to a separate fresh-tunnel reconcile-latency case and shouldn't be conflated.n0-computer/iroh-proxy-utilsbecause that's a generic primitives crate. If investigation here finds the root cause is in one of those primitives (e.g. a default timeout, missing cache invalidation hook), escalating upstream from this issue's discussion is the appropriate path — but the workload-specific symptom report belongs against the deployed service that's exhibiting it.Action for @drewr
Verify the symptom against iroh-gateway logs/metrics, confirm or reject the hypotheses, and either close-as-not-a-bug if my read is wrong or retitle/scope this once root cause is understood. If the fix turns out to live in
iroh-proxy-utils, link a follow-up issue there from this one.