Skip to content

Bound CDP macrotask drains so commands aren't queued behind page work#2405

Open
navidemad wants to merge 1 commit intolightpanda-io:mainfrom
navidemad:worktree-fix-2402-cdp-macrotask-budget
Open

Bound CDP macrotask drains so commands aren't queued behind page work#2405
navidemad wants to merge 1 commit intolightpanda-io:mainfrom
navidemad:worktree-fix-2402-cdp-macrotask-budget

Conversation

@navidemad
Copy link
Copy Markdown
Contributor

@navidemad navidemad commented May 9, 2026

What this fixes

On pages with sustained JS activity (Angular SPAs in change-detection / requestAnimationFrame chains), every session-scoped CDP command — Runtime.evaluate, DOM.getDocument, DOM.getOuterHTML — stalls for 14–20 seconds in serve mode. fetch --dump html of the same page works fine because --wait-ms caps it; CDP has no equivalent ceiling.

The cleanest signal in the issue's reproducer is that Runtime.evaluate('1+1') — a zero-cost roundtrip — takes 14.7 seconds. Whatever the command is, it isn't slow because of what it does; it's slow because it can't be read off the WebSocket. See #2402 for the full trace.

Root cause

Runner._tick (CDP mode, .html / .complete branch) calls browser.runMacrotasks() before yielding to socket I/O via http_client.tick(...). Browser.runMacrotasks drains three loops back-to-back, none of which yield:

  1. env.runMacrotasks()Scheduler.runQueue's while (queue.peek()) — every ready user-scheduled task (timers, setTimeout, requestAnimationFrame, …)
  2. env.pumpMessageLoop()while (v8__Platform__PumpMessageLoop(...)) {} — every V8 platform task
  3. env.runMicrotasks() — every queued microtask

On the OPSWAT page this drain runs for 14–20 s before returning. CDP commands sent during that window sit in the kernel WebSocket buffer the whole time. Once the drain finishes, the very next http_client.tick(0) reads them and they execute in ~0 ms — confirming the bottleneck is the drain, not the command.

What this PR changes

Threads an optional monotonic-clock deadline through the drain chain. Inner loops check it between tasks and yield to the caller when it has elapsed, leaving still-ready tasks queued for the next pass.

The hot edit is in Runner.zig:

// In CDP mode, bound the drain so a long Angular-style macrotask
// chain doesn't block us from polling the WebSocket below. In
// non-CDP mode (fetch), let the drain run to completion — there
// is no socket to service and `wait_ms` already caps wall time.
const macrotask_deadline: ?u64 = if (comptime is_cdp)
    milliTimestamp(.monotonic) + CDP_MACROTASK_BUDGET_MS
else
    null;

try browser.runMacrotasks(macrotask_deadline);

The deadline is plumbed through Browser.runMacrotasksEnv.runMacrotasks / Env.pumpMessageLoopScheduler.run / Scheduler.runQueue as deadline_ms: ?u64. All five gain the parameter. null preserves the existing unbounded behavior — that's what every non-CDP caller passes (worker Local.runMacrotasks, Context.deinit, the direct scheduler.run in ScriptManagerBase, and Runner._tick in fetch mode).

Why 50 ms

It's a trade-off between page progress and CDP responsiveness. The deadline is checked between tasks, so:

  • Smaller (e.g. 5 ms) → finer-grained yielding, more time spent re-entering the tick loop than running JS.
  • Larger (e.g. 500 ms) → coarser yielding; on a page where individual callbacks are short, you'd see CDP commands wait that long behind the next batch.
  • 50 ms → reasonable middle ground. On a page with short callbacks, CDP commands are picked up within ~50 ms. On a page with long single callbacks (like OPSWAT, see verification below), the per-command floor is set by the longest individual callback, not the budget — sub-second response on those pages needs ask make: fix help w/ linux #3.

Hard-coded as CDP_MACROTASK_BUDGET_MS in Runner.zig. Could be exposed as a serve flag later — the issue reporter is comfortable with anything well under 1 s, so a follow-up is fine.

Caveats

Where to focus review

The new parameter on Browser.runMacrotasks / Env.runMacrotasks / Scheduler.run / Scheduler.runQueue is mechanical — each had exactly one caller before this PR.

Env.pumpMessageLoop had three callers, and the two non-Browser ones run in different lifecycles. They both pass null (i.e. no behavior change), but I'd appreciate a second read on:

  • src/browser/js/Local.zig:111 — worker context, called from ScriptManagerBase, WorkerGlobalScope, Worker
  • src/browser/js/Context.zig:229 — context deinit, drains residual platform tasks before MicrotaskQueue deletion

Both should be unchanged. But worker / shutdown paths are exactly the kind of edge case where it would be easy to miss something.

Test plan

  • make test — 523/523 pass, including two new Scheduler.run unit tests
  • Scheduler.run(null) drains all 50 queued tasks
  • Scheduler.run(1) (deadline already in the past) runs exactly 1 task, leaves the rest queued; a follow-up Scheduler.run(null) drains them
  • Manual verification against the OPSWAT reproducer in CDP commands stall for 15-20s on pages with sustained JS activity, even when fetch --dump returns the page in 5s #2402 — every session-scoped command moves from "TIMEOUT 20 s" (current main) to ~2.75 s (this PR). See the verification comment below for the side-by-side probe output.

Refs #2402

Pages with sustained JS activity (Angular RAF chains, change detection on
heavy SPAs) hold the V8 thread inside `Browser.runMacrotasks` for many
seconds at a time. Because `Runner._tick` only polls the WebSocket *after*
the drain returns, queued CDP commands sit in the kernel buffer for the
duration — every session-scoped command stalled 14–20s on the reproducer
in lightpanda-io#2402, even `Runtime.evaluate('1+1')`.

Thread an optional monotonic-clock deadline through `Browser.runMacrotasks`
→ `Env.runMacrotasks` / `Env.pumpMessageLoop` → `Scheduler.run` /
`Scheduler.runQueue`. Inner loops check the deadline after each task and
yield back to the caller when it elapses, leaving still-ready tasks in the
queue. `Runner._tick` sets a 50ms deadline in CDP mode and `null`
(unbounded) in fetch mode, preserving existing fetch behavior.

Other `pumpMessageLoop` callers (worker context, Context.deinit) and the
single direct `scheduler.run` call in ScriptManagerBase pass `null`.

Adds two unit tests on `Scheduler.run` covering the no-deadline drain and
the elapsed-deadline yield behavior.

Refs lightpanda-io#2402
@navidemad navidemad marked this pull request as ready for review May 9, 2026 13:39
@navidemad
Copy link
Copy Markdown
Contributor Author

Verified against the issue's cdp-probe.mjs reproducer on the OPSWAT Angular SPA (https://www.opswat.com/docs/mdmft/metadefender-mft), macOS / Darwin 25.4.0:

Command Unpatched (6e9156a8) Patched (2bdc0ae1)
Runtime.evaluate document.documentElement.outerHTML TIMEOUT 20 s OK in 3352 ms
Runtime.evaluate document.documentElement (returnByValue:false) TIMEOUT 20 s OK in 2776 ms
DOM.getDocument {depth:0} TIMEOUT 20 s OK in 2773 ms
DOM.getDocument {} (default depth=3) TIMEOUT 20 s OK in 2791 ms
Runtime.evaluate('1+1') (sanity) TIMEOUT 20 s OK in 2750 ms
DOM.getDocument retry {depth:0} TIMEOUT 20 s OK in 2744 ms
Target.closeTarget TIMEOUT 5 s OK in 2744 ms

Every session-scoped command moves from "never returns" to ~2.75 s. The per-command latency floor matches the duration of a single Angular change-detection callback on this page — exactly the caveat about long single tasks I called out in the body. Sub-second response on pages like this needs V8 RequestInterrupt (issue #2402 ask #3, separate change).

Full probe output

Unpatched:

[+0ms]      connected
[+17ms]     Target.createTarget: OK
[+19ms]     Target.attachToTarget: OK
[+20ms]     Network.enable: OK
[+20ms]     Page.enable: OK
[+20ms]     Emulation.setUserAgentOverride: OK
[+282ms]    Page.navigate: OK
[+282ms]    sleeping 8s post-navigate
[+28284ms]  Runtime.evaluate document.documentElement.outerHTML (returnByValue:true): FAIL in 20000ms — TIMEOUT
[+48286ms]  Runtime.evaluate document.documentElement (returnByValue:false):          FAIL in 20002ms — TIMEOUT
[+68286ms]  DOM.getDocument {depth:0}:                                                FAIL in 20000ms — TIMEOUT
[+88287ms]  DOM.getDocument {} (default depth=3):                                     FAIL in 20001ms — TIMEOUT
[+108288ms] Runtime.evaluate 1+1 (sanity):                                            FAIL in 20000ms — TIMEOUT
[+128289ms] DOM.getDocument retry {depth:0}:                                          FAIL in 20001ms — TIMEOUT
[+133290ms] Target.closeTarget:                                                       FAIL in 5001ms — TIMEOUT

Patched:

[+0ms]      connected
[+2ms]      Target.createTarget: OK
[+3ms]      Target.attachToTarget: OK
[+3ms]      Network.enable: OK
[+3ms]      Page.enable: OK
[+4ms]      Emulation.setUserAgentOverride: OK
[+106ms]    Page.navigate: OK
[+106ms]    sleeping 8s post-navigate
[+11459ms]  Runtime.evaluate document.documentElement.outerHTML (returnByValue:true): OK in 3352ms
[+14236ms]  Runtime.evaluate document.documentElement (returnByValue:false):          OK in 2776ms
[+17009ms]  DOM.getDocument {depth:0}:                                                OK in 2773ms
[+19800ms]  DOM.getDocument {} (default depth=3):                                     OK in 2791ms
[+22550ms]  Runtime.evaluate 1+1 (sanity):                                            OK in 2750ms
[+25294ms]  DOM.getDocument retry {depth:0}:                                          OK in 2744ms
[+28038ms]  Target.closeTarget:                                                       OK in 2744ms

@karlseguin
Copy link
Copy Markdown
Collaborator

I think #2393 does the job, while being much simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants