Skip to content

diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1

Draft
nv-yna wants to merge 7 commits into
feat/deepseek_v4_aafrom
yna/dsv4-faultdetect-fix
Draft

diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1
nv-yna wants to merge 7 commits into
feat/deepseek_v4_aafrom
yna/dsv4-faultdetect-fix

Conversation

@nv-yna

@nv-yna nv-yna commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Diagnostic + fix branch for the DeepSeek-V4-Pro disagg perf investigation, on top of feat/deepseek_v4_aa
(base = 5142978). Opened for review of the changes; the instrument commits are env-gated (off by default).

Commits (7)

Fixes

  • bb6b5cac29 fix(frontend): offload prompt tokenization off the async event loop — removes an earlier hard
    throughput collapse where tokenization blocked the frontend event loop under the c920 closed-loop load.
  • bab290059a fix(runtime): hysteresis before inhibiting a worker on transient request-plane faults.

Diagnostics (all env-gated, zero-cost when off)

  • 096c9a6470d / 097c9d85bc instrument: request-plane ACK decode→flush timing + per-runtime event-loop stall
    / request arrival-delay logging (DYN_ACK_TRACE).
  • ddb7ba24e8 / 0b4ca48ee3 instrument: per-op busy-time attribution for frontend event-loop stalls + split
    tokenize-offload latency into pool-wait vs encode (DYN_STALL_OP_TRACE).
  • 70ae211745 instrument: gated prefill lifecycle trace (a→e) + per-CTX in-flight gauge (DYN_PREFILL_TRACE).

Scope

17 files, ~+826 lines (mostly the gated trace plumbing in lib/llm/src/kv_router/, lib/runtime/,
components/src/dynamo/trtllm/). Diff is exactly these 7 commits vs feat/deepseek_v4_aa.

This is the image code behind the DSV4 disagg root-cause work (see internal ROOT_CAUSE notes). Not necessarily
for merge as-is — the diagnostics may be trimmed before any upstream submission.

nv-yna added 7 commits June 28, 2026 03:04
…est-plane faults

Add a per-instance consecutive-failure threshold (env DYN_FAULT_INHIBIT_THRESHOLD,
default 1 = legacy: inhibit on first failure) gating report_instance_down, with a
reset on successful dispatch. A single ~5s request-plane ACK timeout under decode
saturation no longer removes a live worker (which then flapped back every 5s via
reconcile). All is_inhibited fault sites in push_router route through this path.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Add per-request ACK latency tracing on the worker request plane to localize the
iter4 collapse trigger (GEN request-plane ACK starvation). Carries decoded_at +
request_id from read_loop to write_loop so the flush latency splits into
queue_ms (write-task scheduling delay = runtime/CPU starvation) vs write_ms
(socket), logged to target dynamo_ack_trace, WARN above DYN_ACK_TRACE_WARN_MS.
Frontend CannotConnect timeout log gains request_id+timeout_s for correlation.
Gated by DYN_ACK_TRACE=1; zero-overhead when off.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
… (DYN_ACK_TRACE)

Add direct logging to identify WHICH runtime is starved during the request-plane
ACK-timeout collapse, instead of relying on the frontend-only metric scrape:
- tokio_metrics_and_canary_loop now WARNs to target dynamo_stall when its event
  loop is delayed past DYN_STALL_LOG_MS (default 250ms).
- spawn that canary from SharedTcpServer.bind_and_start (gated by DYN_ACK_TRACE)
  so WORKER runtimes are monitored, not just the frontend HTTP service.
- read_loop WARNs the frontend-send -> worker-decode arrival delay (the pre-decode
  wait the decode->flush trace misses) to target dynamo_ack_trace.
Together: each process's log shows if/when its request-plane runtime starves.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
ROOT CAUSE of the DSV4 disagg SLO collapse: the frontend generate path tokenized
the ~40k-token prompt SYNCHRONOUSLY on the async event loop (gather_tokens ->
encode_with_timing -> tokenizer.encode), unlike the embedding path which already
offloads via spawn_blocking. At high concurrency this stalls the frontend tokio
runtime multi-second (measured: 135 event-loop stalls, max 2.66s), starving the
request-plane I/O -> 5s ACK timeout -> CannotConnect -> GEN worker inhibited ->
flap/cascade -> throughput collapse (105 -> 14-30 req/s), while GEN/CTX engines
stay healthy. Evidence: iter4 C8diag2 stall logs + send->decode 14.6s arrival.

Fix: make encode_with_timing async and run tokenizer.encode on the bounded
blocking pool (spawn_blocking), mirroring the embedding path; propagate async
through gather_tokens and its two callers. Frees the event loop so request-plane
I/O is polled -> no false CannotConnect -> no cascade.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…ls (DYN_STALL_OP_TRACE)

The stall canary proves THAT the frontend event loop stalls but not WHICH op. Add
env-gated (DYN_STALL_OP_TRACE=1, DYN_STALL_OP_WARN_MS default 50) per-op BUSY-time
WARN to target dynamo_stall_op so a residual frontend stall (after the tokenization
offload) is attributed to a NAMED synchronous request-path op rather than inferred:
- kv_router.rs find_best_match_details: reuse already-computed deltas (hash_elapsed,
  seq_hash_elapsed-hash_elapsed = block/seq-hash BUSY [no await]; indexer_duration;
  total-find_matches = schedule, wall upper-bound).
- queue.rs admit_one: project_worker_loads + select_worker BUSY on the scheduler-actor
  task (op=admit_select).
Busy-not-wall (brackets contain no .await), correlates by timestamp with dynamo_stall.
Zero-cost when off. Per the subagent analysis: the two likely residual culprits are the
inline indexer radix walk and the serialized scheduler admission.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…DYN_STALL_OP_TRACE)

To resolve codex (tokenize spawn_blocking pool-queue) vs per-log workflow (router
admission-queue) for the receive->prefill-dispatch TTFT gap: time the encode INSIDE
the spawn_blocking task so total offload latency splits into pool_wait_ms (submit->run,
the bounded-blocking-pool queue) + encode_ms (run->done). WARN to dynamo_stall_op
op=tokenize when total>=DYN_STALL_OP_WARN_MS. Paired with the existing op=schedule
(router-queue/scheduler wall) and op=admit_select, one run now splits the 6.3s gap into
tokenize-pool-wait vs router-queue cleanly. Zero-cost when off.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…DYN_PREFILL_TRACE)

Localize disagg prefill pre-decode time: per-request a(arrive)->b(CTX-selected)->
c(dispatch)->d(first-resp)->e(done) gaps + per-CTX in-flight-prefill gauge.
Discriminator for CTX under-feed root cause. Zero-cost when off.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant