Skip to content

Relay UI lag: dev-mode Next.js + redundant polling + recompile churn over the frp tunnel #323

Description

@PotatoParser

What happened

The web UI is heavily laggy over the gini-relay tunnel, and Next.js constantly shows "rendering" / "compiling". Tracing the default instance (the one the relay is attached to) found four compounding causes, all amplified by the relay round-trip.

The relay is an frp reverse tunnel: browser → Caddy TLS at gini-relay.lilaclabs.ai → frps → frpc on the host → gateway → Next.js child (evidence: node_modules/gini-relay/frps.toml:6, Caddyfile). Measured edge round-trip is 98.3 ms TTFB (DNS 18.2, connect 41.8, TLS 73.8) versus 7–14 ms for the same API served on localhost (web.log per-request timings). So every request that is instant locally costs at least 98.3 ms over the relay, and the UI fires hundreds of them.

Root cause #1 — the relay serves Next.js in DEV mode. src/cli/process.ts:360 spawns the web child as ["run", "dev", "--", "-H", "127.0.0.1", "-p", String(port)] (next dev). Consequences: compile-on-demand (web.log shows compiles from 32 ms up to 3.4 s at ✓ Compiled in 3.4s, web.log:184074); a 179 MB unminified dev build dir (web/.next-<instance>) served as uncached per-module chunks the browser refetches across the tunnel; plus an HMR websocket held open through frp.

Root cause #2 — redundant fast polling, each tick a relay round-trip. web/src/components/RuntimeStreamBridge.tsx:72-73 already invalidates react-query keys on every SSE event, and the code itself says per-query refetchInterval is therefore "only … a slow safety net." But several high-traffic queries still poll every 3000 ms (one at 800 ms). Request counts from one web.log:

Endpoint requests driver
/api/runtime/chat 43124 useChatSessions 3000 ms (queries.ts:178) + agent-chat 3000 ms (620)
/api/runtime/jobs 22837 useAllJobs 3000 ms (queries.ts:167)
/api/runtime/threads 21844 threads 3000 ms (655) + threads-inbox 3000 ms (745)
/api/runtime/__healthz 17060 UpdateGate / health probe
/api/runtime/chat/<id>/threads 8952 per-session threads 3000 ms
/api/runtime/agents/<id>/chat 8947 active agent chat 3000 ms

Plus per-task polling (thousands of GET /api/runtime/tasks/task_<id> rows, each task 5–12×) and the active-chat poll dropping to 800 ms in flight (queries.ts:336) and browser status at 1000 ms (queries.ts:825). On the chat page the 3000 ms pollers alone fire 5 requests / 3 s = 100 requests/min, each at least 98.3 ms over the relay, independent of whether anything changed. providers.tsx:57 also sets refetchOnWindowFocus: true, so every tab refocus refetches every mounted query at once.

Root cause #3 — frequent web-dev restarts, each a cold recompile. web.log contains 19 ✓ Ready in lines (dev-server restarts) and 6 ✓ Compiled lines. Each restart discards the compiled-route cache, so the next navigation recompiles (the 3.4 s compile lands right after a restart cluster, ✓ Ready in 830ms/759ms at 184287/184567). Likely triggers: autostart reconcile/refresh / self-update (src/runtime/autostart-reconcile.ts, autostart-refresh.ts); issue #260 already traced one "mystery SIGTERM" to autostart reconcile.

Root cause #4 — SSE reconnect churn over the tunnel keeps the pollers primary. The runtime stream is long-lived SSE (useRuntimeStream.ts/api/runtime/events/stream; per-session /api/runtime/chat/<id>/stream). It needed resilient-event-source.ts because "a gateway restart turns the BFF's stream route into a 503, which permanently CLOSES a bare EventSource." web.log shows /api/runtime/events/stream reopened 84× and the main session's stream 76×. Long-lived SSE is the most fragile connection over frp, and every drop leans on the #2 polling as the primary freshness path — which is why those safety nets fire tens of thousands of times.

What you expected

Over the relay the UI should feel responsive: assets cached and minified, no recompile stalls, and request volume bounded by actual activity (driven by SSE) rather than a constant 100+ requests/min of redundant polling each paying the tunnel round-trip.

Reproduction

  1. Attach an instance to the gini-relay tunnel and open the chat UI through the *.gini-relay.lilaclabs.ai URL.
  2. Navigate between routes / agents and send a chat turn; observe "compiling" stalls and lag on each interaction.
  3. Inspect ~/.gini/instances/<instance>/logs/web.log: count API polls (rg -o "GET /api/runtime/[^ ?]+" web.log | sort | uniq -c | sort -rn) and compile/restart lines (rg -c "Ready in", rg -n "Compiled in").

Environment

  • OS: macOS 26.3 (build 25D125)
  • Bun version (bun --version): 1.3.14
  • Gini install method: git clone
  • Instance name: default
  • Provider (codex / openai / openrouter / local / echo): codex (gpt-5.5)

Proposed fixes

Quick wins (config / intervals, low-risk, reversible):

  1. Trust SSE — raise the 3000 ms refetchIntervals (queries.ts:167, 178, 620, 655, 745) to a true safety-net cadence (30000–60000 ms; 60000/3000 = 20× traffic cut, 30000/3000 = 10×), and gate the 800 ms active poll (queries.ts:336) behind "SSE unhealthy."
  2. Disable refetchOnWindowFocus (providers.tsx:57) or scope it to the few queries that need it.
  3. Slow the __healthz poll (17060 hits) to a heartbeat cadence.

Structural:
4. Serve a production build over the relay — a path where the web child runs next build once then next start (vs next dev at process.ts:360), at least when a relay/tunnel is active. Removes compile-on-demand, the 3.4 s stalls, the 179 MB unminified output, and gives cacheable hashed assets.
5. Enable compression at the relay edge (Caddy encode zstd gzip on the wildcard vhost; frp does not compress by default).
6. Investigate the 19 dev restarts (autostart-reconcile / autostart-refresh / self-update) so a relay session isn't triggering web restarts that each cost a cold recompile.

Logs

# default instance web.log — API request volume (one log)
43124 GET /api/runtime/chat
22837 GET /api/runtime/jobs
21844 GET /api/runtime/threads
17060 GET /api/runtime/__healthz
   84 GET /api/runtime/events/stream      # SSE reopens
   76 GET /api/runtime/chat/<id>/stream

# compile / restart churn
✓ Compiled in 3.4s          (web.log:184074)
19 × "✓ Ready in …"          (dev-server restarts)

# relay edge vs local round-trip
gini-relay.lilaclabs.ai/  -> TTFB 98.3 ms (dns 18.2 / connect 41.8 / tls 73.8)
localhost API             -> 7–14 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions