diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1
Draft
nv-yna wants to merge 7 commits into
Draft
diag: fault-detect fixes + tracing (tokenize offload, ACK/stall/prefill traces)#1nv-yna wants to merge 7 commits into
nv-yna wants to merge 7 commits into
Conversation
…est-plane faults Add a per-instance consecutive-failure threshold (env DYN_FAULT_INHIBIT_THRESHOLD, default 1 = legacy: inhibit on first failure) gating report_instance_down, with a reset on successful dispatch. A single ~5s request-plane ACK timeout under decode saturation no longer removes a live worker (which then flapped back every 5s via reconcile). All is_inhibited fault sites in push_router route through this path. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Add per-request ACK latency tracing on the worker request plane to localize the iter4 collapse trigger (GEN request-plane ACK starvation). Carries decoded_at + request_id from read_loop to write_loop so the flush latency splits into queue_ms (write-task scheduling delay = runtime/CPU starvation) vs write_ms (socket), logged to target dynamo_ack_trace, WARN above DYN_ACK_TRACE_WARN_MS. Frontend CannotConnect timeout log gains request_id+timeout_s for correlation. Gated by DYN_ACK_TRACE=1; zero-overhead when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
… (DYN_ACK_TRACE) Add direct logging to identify WHICH runtime is starved during the request-plane ACK-timeout collapse, instead of relying on the frontend-only metric scrape: - tokio_metrics_and_canary_loop now WARNs to target dynamo_stall when its event loop is delayed past DYN_STALL_LOG_MS (default 250ms). - spawn that canary from SharedTcpServer.bind_and_start (gated by DYN_ACK_TRACE) so WORKER runtimes are monitored, not just the frontend HTTP service. - read_loop WARNs the frontend-send -> worker-decode arrival delay (the pre-decode wait the decode->flush trace misses) to target dynamo_ack_trace. Together: each process's log shows if/when its request-plane runtime starves. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
ROOT CAUSE of the DSV4 disagg SLO collapse: the frontend generate path tokenized the ~40k-token prompt SYNCHRONOUSLY on the async event loop (gather_tokens -> encode_with_timing -> tokenizer.encode), unlike the embedding path which already offloads via spawn_blocking. At high concurrency this stalls the frontend tokio runtime multi-second (measured: 135 event-loop stalls, max 2.66s), starving the request-plane I/O -> 5s ACK timeout -> CannotConnect -> GEN worker inhibited -> flap/cascade -> throughput collapse (105 -> 14-30 req/s), while GEN/CTX engines stay healthy. Evidence: iter4 C8diag2 stall logs + send->decode 14.6s arrival. Fix: make encode_with_timing async and run tokenizer.encode on the bounded blocking pool (spawn_blocking), mirroring the embedding path; propagate async through gather_tokens and its two callers. Frees the event loop so request-plane I/O is polled -> no false CannotConnect -> no cascade. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…ls (DYN_STALL_OP_TRACE) The stall canary proves THAT the frontend event loop stalls but not WHICH op. Add env-gated (DYN_STALL_OP_TRACE=1, DYN_STALL_OP_WARN_MS default 50) per-op BUSY-time WARN to target dynamo_stall_op so a residual frontend stall (after the tokenization offload) is attributed to a NAMED synchronous request-path op rather than inferred: - kv_router.rs find_best_match_details: reuse already-computed deltas (hash_elapsed, seq_hash_elapsed-hash_elapsed = block/seq-hash BUSY [no await]; indexer_duration; total-find_matches = schedule, wall upper-bound). - queue.rs admit_one: project_worker_loads + select_worker BUSY on the scheduler-actor task (op=admit_select). Busy-not-wall (brackets contain no .await), correlates by timestamp with dynamo_stall. Zero-cost when off. Per the subagent analysis: the two likely residual culprits are the inline indexer radix walk and the serialized scheduler admission. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…DYN_STALL_OP_TRACE) To resolve codex (tokenize spawn_blocking pool-queue) vs per-log workflow (router admission-queue) for the receive->prefill-dispatch TTFT gap: time the encode INSIDE the spawn_blocking task so total offload latency splits into pool_wait_ms (submit->run, the bounded-blocking-pool queue) + encode_ms (run->done). WARN to dynamo_stall_op op=tokenize when total>=DYN_STALL_OP_WARN_MS. Paired with the existing op=schedule (router-queue/scheduler wall) and op=admit_select, one run now splits the 6.3s gap into tokenize-pool-wait vs router-queue cleanly. Zero-cost when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…DYN_PREFILL_TRACE) Localize disagg prefill pre-decode time: per-request a(arrive)->b(CTX-selected)-> c(dispatch)->d(first-resp)->e(done) gaps + per-CTX in-flight-prefill gauge. Discriminator for CTX under-feed root cause. Zero-cost when off. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnostic + fix branch for the DeepSeek-V4-Pro disagg perf investigation, on top of
feat/deepseek_v4_aa(base = 5142978). Opened for review of the changes; the instrument commits are env-gated (off by default).
Commits (7)
Fixes
bb6b5cac29fix(frontend): offload prompt tokenization off the async event loop — removes an earlier hardthroughput collapse where tokenization blocked the frontend event loop under the c920 closed-loop load.
bab290059afix(runtime): hysteresis before inhibiting a worker on transient request-plane faults.Diagnostics (all env-gated, zero-cost when off)
096c9a6470d/097c9d85bcinstrument: request-plane ACK decode→flush timing + per-runtime event-loop stall/ request arrival-delay logging (
DYN_ACK_TRACE).ddb7ba24e8/0b4ca48ee3instrument: per-op busy-time attribution for frontend event-loop stalls + splittokenize-offload latency into pool-wait vs encode (
DYN_STALL_OP_TRACE).70ae211745instrument: gated prefill lifecycle trace (a→e) + per-CTX in-flight gauge (DYN_PREFILL_TRACE).Scope
17 files, ~+826 lines (mostly the gated trace plumbing in
lib/llm/src/kv_router/,lib/runtime/,components/src/dynamo/trtllm/). Diff is exactly these 7 commits vsfeat/deepseek_v4_aa.This is the image code behind the DSV4 disagg root-cause work (see internal ROOT_CAUSE notes). Not necessarily
for merge as-is — the diagnostics may be trimmed before any upstream submission.