diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces) by nv-yna · Pull Request #1 · nv-yna/dynamo

nv-yna · 2026-06-29T11:03:36Z

Diagnostic + fix branch for the DeepSeek-V4-Pro disagg perf investigation, on top of feat/deepseek_v4_aa
(base = 5142978). Opened for review of the changes; the instrument commits are env-gated (off by default).

Commits (7)

Fixes

bb6b5cac29 fix(frontend): offload prompt tokenization off the async event loop — removes an earlier hard
throughput collapse where tokenization blocked the frontend event loop under the c920 closed-loop load.
bab290059a fix(runtime): hysteresis before inhibiting a worker on transient request-plane faults.

Diagnostics (all env-gated, zero-cost when off)

096c9a6470d / 097c9d85bc instrument: request-plane ACK decode→flush timing + per-runtime event-loop stall
/ request arrival-delay logging (DYN_ACK_TRACE).
ddb7ba24e8 / 0b4ca48ee3 instrument: per-op busy-time attribution for frontend event-loop stalls + split
tokenize-offload latency into pool-wait vs encode (DYN_STALL_OP_TRACE).
70ae211745 instrument: gated prefill lifecycle trace (a→e) + per-CTX in-flight gauge (DYN_PREFILL_TRACE).

Scope

17 files, ~+826 lines (mostly the gated trace plumbing in lib/llm/src/kv_router/, lib/runtime/,
components/src/dynamo/trtllm/). Diff is exactly these 7 commits vs feat/deepseek_v4_aa.

This is the image code behind the DSV4 disagg root-cause work (see internal ROOT_CAUSE notes). Not necessarily
for merge as-is — the diagnostics may be trimmed before any upstream submission.

…est-plane faults Add a per-instance consecutive-failure threshold (env DYN_FAULT_INHIBIT_THRESHOLD, default 1 = legacy: inhibit on first failure) gating report_instance_down, with a reset on successful dispatch. A single ~5s request-plane ACK timeout under decode saturation no longer removes a live worker (which then flapped back every 5s via reconcile). All is_inhibited fault sites in push_router route through this path. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

Add per-request ACK latency tracing on the worker request plane to localize the iter4 collapse trigger (GEN request-plane ACK starvation). Carries decoded_at + request_id from read_loop to write_loop so the flush latency splits into queue_ms (write-task scheduling delay = runtime/CPU starvation) vs write_ms (socket), logged to target dynamo_ack_trace, WARN above DYN_ACK_TRACE_WARN_MS. Frontend CannotConnect timeout log gains request_id+timeout_s for correlation. Gated by DYN_ACK_TRACE=1; zero-overhead when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

… (DYN_ACK_TRACE) Add direct logging to identify WHICH runtime is starved during the request-plane ACK-timeout collapse, instead of relying on the frontend-only metric scrape: - tokio_metrics_and_canary_loop now WARNs to target dynamo_stall when its event loop is delayed past DYN_STALL_LOG_MS (default 250ms). - spawn that canary from SharedTcpServer.bind_and_start (gated by DYN_ACK_TRACE) so WORKER runtimes are monitored, not just the frontend HTTP service. - read_loop WARNs the frontend-send -> worker-decode arrival delay (the pre-decode wait the decode->flush trace misses) to target dynamo_ack_trace. Together: each process's log shows if/when its request-plane runtime starves. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

ROOT CAUSE of the DSV4 disagg SLO collapse: the frontend generate path tokenized the ~40k-token prompt SYNCHRONOUSLY on the async event loop (gather_tokens -> encode_with_timing -> tokenizer.encode), unlike the embedding path which already offloads via spawn_blocking. At high concurrency this stalls the frontend tokio runtime multi-second (measured: 135 event-loop stalls, max 2.66s), starving the request-plane I/O -> 5s ACK timeout -> CannotConnect -> GEN worker inhibited -> flap/cascade -> throughput collapse (105 -> 14-30 req/s), while GEN/CTX engines stay healthy. Evidence: iter4 C8diag2 stall logs + send->decode 14.6s arrival. Fix: make encode_with_timing async and run tokenizer.encode on the bounded blocking pool (spawn_blocking), mirroring the embedding path; propagate async through gather_tokens and its two callers. Frees the event loop so request-plane I/O is polled -> no false CannotConnect -> no cascade. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…ls (DYN_STALL_OP_TRACE) The stall canary proves THAT the frontend event loop stalls but not WHICH op. Add env-gated (DYN_STALL_OP_TRACE=1, DYN_STALL_OP_WARN_MS default 50) per-op BUSY-time WARN to target dynamo_stall_op so a residual frontend stall (after the tokenization offload) is attributed to a NAMED synchronous request-path op rather than inferred: - kv_router.rs find_best_match_details: reuse already-computed deltas (hash_elapsed, seq_hash_elapsed-hash_elapsed = block/seq-hash BUSY [no await]; indexer_duration; total-find_matches = schedule, wall upper-bound). - queue.rs admit_one: project_worker_loads + select_worker BUSY on the scheduler-actor task (op=admit_select). Busy-not-wall (brackets contain no .await), correlates by timestamp with dynamo_stall. Zero-cost when off. Per the subagent analysis: the two likely residual culprits are the inline indexer radix walk and the serialized scheduler admission. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…DYN_STALL_OP_TRACE) To resolve codex (tokenize spawn_blocking pool-queue) vs per-log workflow (router admission-queue) for the receive->prefill-dispatch TTFT gap: time the encode INSIDE the spawn_blocking task so total offload latency splits into pool_wait_ms (submit->run, the bounded-blocking-pool queue) + encode_ms (run->done). WARN to dynamo_stall_op op=tokenize when total>=DYN_STALL_OP_WARN_MS. Paired with the existing op=schedule (router-queue/scheduler wall) and op=admit_select, one run now splits the 6.3s gap into tokenize-pool-wait vs router-queue cleanly. Zero-cost when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

…DYN_PREFILL_TRACE) Localize disagg prefill pre-decode time: per-request a(arrive)->b(CTX-selected)-> c(dispatch)->d(first-resp)->e(done) gaps + per-CTX in-flight-prefill gauge. Discriminator for CTX under-feed root cause. Zero-cost when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna added 7 commits June 28, 2026 03:04

github-actions Bot added router frontend labels Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1

diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1
nv-yna wants to merge 7 commits into
feat/deepseek_v4_aafrom
yna/dsv4-faultdetect-fix

nv-yna commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nv-yna commented Jun 29, 2026

Commits (7)

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant