Relay UI lag: dev-mode Next.js + redundant polling + recompile churn over the frp tunnel

## What happened

The web UI is heavily laggy over the gini-relay tunnel, and Next.js constantly shows "rendering" / "compiling". Tracing the `default` instance (the one the relay is attached to) found four compounding causes, all amplified by the relay round-trip.

The relay is an frp reverse tunnel: browser → Caddy TLS at `gini-relay.lilaclabs.ai` → frps → frpc on the host → gateway → Next.js child (evidence: `node_modules/gini-relay/frps.toml:6`, `Caddyfile`). Measured edge round-trip is **98.3 ms TTFB** (DNS 18.2, connect 41.8, TLS 73.8) versus **7–14 ms** for the same API served on `localhost` (web.log per-request timings). So every request that is instant locally costs at least 98.3 ms over the relay, and the UI fires hundreds of them.

**Root cause #1 — the relay serves Next.js in DEV mode.** `src/cli/process.ts:360` spawns the web child as `["run", "dev", "--", "-H", "127.0.0.1", "-p", String(port)]` (`next dev`). Consequences: compile-on-demand (web.log shows compiles from 32 ms up to **3.4 s** at `✓ Compiled in 3.4s`, web.log:184074); a **179 MB** unminified dev build dir (`web/.next-<instance>`) served as uncached per-module chunks the browser refetches across the tunnel; plus an HMR websocket held open through frp.

**Root cause #2 — redundant fast polling, each tick a relay round-trip.** `web/src/components/RuntimeStreamBridge.tsx:72-73` already invalidates react-query keys on every SSE event, and the code itself says per-query `refetchInterval` is therefore "only … a slow safety net." But several high-traffic queries still poll every 3000 ms (one at 800 ms). Request counts from one web.log:

| Endpoint | requests | driver |
|---|---|---|
| `/api/runtime/chat` | 43124 | `useChatSessions` 3000 ms (queries.ts:178) + agent-chat 3000 ms (620) |
| `/api/runtime/jobs` | 22837 | `useAllJobs` 3000 ms (queries.ts:167) |
| `/api/runtime/threads` | 21844 | threads 3000 ms (655) + threads-inbox 3000 ms (745) |
| `/api/runtime/__healthz` | 17060 | UpdateGate / health probe |
| `/api/runtime/chat/<id>/threads` | 8952 | per-session threads 3000 ms |
| `/api/runtime/agents/<id>/chat` | 8947 | active agent chat 3000 ms |

Plus per-task polling (thousands of `GET /api/runtime/tasks/task_<id>` rows, each task 5–12×) and the active-chat poll dropping to **800 ms** in flight (queries.ts:336) and browser status at **1000 ms** (queries.ts:825). On the chat page the 3000 ms pollers alone fire 5 requests / 3 s = 100 requests/min, each at least 98.3 ms over the relay, independent of whether anything changed. `providers.tsx:57` also sets `refetchOnWindowFocus: true`, so every tab refocus refetches every mounted query at once.

**Root cause #3 — frequent web-dev restarts, each a cold recompile.** web.log contains **19 `✓ Ready in` lines** (dev-server restarts) and 6 `✓ Compiled` lines. Each restart discards the compiled-route cache, so the next navigation recompiles (the 3.4 s compile lands right after a restart cluster, `✓ Ready in 830ms`/`759ms` at 184287/184567). Likely triggers: autostart reconcile/refresh / self-update (`src/runtime/autostart-reconcile.ts`, `autostart-refresh.ts`); issue #260 already traced one "mystery SIGTERM" to autostart reconcile.

**Root cause #4 — SSE reconnect churn over the tunnel keeps the pollers primary.** The runtime stream is long-lived SSE (`useRuntimeStream.ts` → `/api/runtime/events/stream`; per-session `/api/runtime/chat/<id>/stream`). It needed `resilient-event-source.ts` because "a gateway restart turns the BFF's stream route into a 503, which permanently CLOSES a bare EventSource." web.log shows `/api/runtime/events/stream` reopened 84× and the main session's stream 76×. Long-lived SSE is the most fragile connection over frp, and every drop leans on the #2 polling as the primary freshness path — which is why those safety nets fire tens of thousands of times.

## What you expected

Over the relay the UI should feel responsive: assets cached and minified, no recompile stalls, and request volume bounded by actual activity (driven by SSE) rather than a constant 100+ requests/min of redundant polling each paying the tunnel round-trip.

## Reproduction

1. Attach an instance to the gini-relay tunnel and open the chat UI through the `*.gini-relay.lilaclabs.ai` URL.
2. Navigate between routes / agents and send a chat turn; observe "compiling" stalls and lag on each interaction.
3. Inspect `~/.gini/instances/<instance>/logs/web.log`: count API polls (`rg -o "GET /api/runtime/[^ ?]+" web.log | sort | uniq -c | sort -rn`) and compile/restart lines (`rg -c "Ready in"`, `rg -n "Compiled in"`).

## Environment

- OS: macOS 26.3 (build 25D125)
- Bun version (`bun --version`): 1.3.14
- Gini install method: git clone
- Instance name: default
- Provider (codex / openai / openrouter / local / echo): codex (gpt-5.5)

## Proposed fixes

**Quick wins (config / intervals, low-risk, reversible):**
1. Trust SSE — raise the 3000 ms `refetchInterval`s (queries.ts:167, 178, 620, 655, 745) to a true safety-net cadence (30000–60000 ms; 60000/3000 = 20× traffic cut, 30000/3000 = 10×), and gate the 800 ms active poll (queries.ts:336) behind "SSE unhealthy."
2. Disable `refetchOnWindowFocus` (providers.tsx:57) or scope it to the few queries that need it.
3. Slow the `__healthz` poll (17060 hits) to a heartbeat cadence.

**Structural:**
4. Serve a production build over the relay — a path where the web child runs `next build` once then `next start` (vs `next dev` at process.ts:360), at least when a relay/tunnel is active. Removes compile-on-demand, the 3.4 s stalls, the 179 MB unminified output, and gives cacheable hashed assets.
5. Enable compression at the relay edge (Caddy `encode zstd gzip` on the wildcard vhost; frp does not compress by default).
6. Investigate the 19 dev restarts (autostart-reconcile / autostart-refresh / self-update) so a relay session isn't triggering web restarts that each cost a cold recompile.

## Logs

```
# default instance web.log — API request volume (one log)
43124 GET /api/runtime/chat
22837 GET /api/runtime/jobs
21844 GET /api/runtime/threads
17060 GET /api/runtime/__healthz
   84 GET /api/runtime/events/stream      # SSE reopens
   76 GET /api/runtime/chat/<id>/stream

# compile / restart churn
✓ Compiled in 3.4s          (web.log:184074)
19 × "✓ Ready in …"          (dev-server restarts)

# relay edge vs local round-trip
gini-relay.lilaclabs.ai/  -> TTFB 98.3 ms (dns 18.2 / connect 41.8 / tls 73.8)
localhost API             -> 7–14 ms
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relay UI lag: dev-mode Next.js + redundant polling + recompile churn over the frp tunnel #323

What happened

What you expected

Reproduction

Environment

Proposed fixes

Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Endpoint	requests	driver
`/api/runtime/chat`	43124	`useChatSessions` 3000 ms (queries.ts:178) + agent-chat 3000 ms (620)
`/api/runtime/jobs`	22837	`useAllJobs` 3000 ms (queries.ts:167)
`/api/runtime/threads`	21844	threads 3000 ms (655) + threads-inbox 3000 ms (745)
`/api/runtime/__healthz`	17060	UpdateGate / health probe
`/api/runtime/chat/<id>/threads`	8952	per-session threads 3000 ms
`/api/runtime/agents/<id>/chat`	8947	active agent chat 3000 ms

Uh oh!

Relay UI lag: dev-mode Next.js + redundant polling + recompile churn over the frp tunnel #323

Description

What happened

What you expected

Reproduction

Environment

Proposed fixes

Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions