Skip to content

Add DeepSeek-V4-Pro 8k/1k SA recipes for GB300 (MTP-off + MTP-on, MXFP4)#192

Open
nv-yna wants to merge 1 commit into
NVIDIA:sa-submission-q2-2026from
nv-yna:yna/dsv4-trt-8k1k-stp
Open

Add DeepSeek-V4-Pro 8k/1k SA recipes for GB300 (MTP-off + MTP-on, MXFP4)#192
nv-yna wants to merge 1 commit into
NVIDIA:sa-submission-q2-2026from
nv-yna:yna/dsv4-trt-8k1k-stp

Conversation

@nv-yna

@nv-yna nv-yna commented Jun 2, 2026

Copy link
Copy Markdown

Summary

26 srt-slurm recipes for the DeepSeek-V4-Pro 8k/1k disagg SA reproduction sweep on GB300. Engine configs are byte-translated from the executed-sweep ctx_config.yaml + gen_config.yaml — the Dynamo (dynamo.frontend + dynamo.trtllm) runs that produced the published 8k/1k pareto. One recipe per frontier operating point, across 5 decode topologies × 3 MTP modes:

  • MTP=Off (13): TEP8 c=4; TEP4 c=5/15/25/55; DEP32 c=154/308/615/1127; DEP16 c=1229/2253; DEP8 c=2253/4301
  • MTP=3 (11): TEP8 c=8; TEP4 c=10/15/30; DEP32 c=84/180/333/615; DEP16 c=666/1229; DEP8 c=1229
  • MTP=1 (2): DEP8 c=2253/4301

Precision = MXFP4 (verified from the checkpoint tensor dtypes: MoE experts FP4 with E8M0 block-32 scales, dense UE8M0-FP8, kv-cache fp8) — hence the gb300_mxfp4/ dir (not nvfp4). Container pinned to the 1.3.0rc15.post1 image-index digest.

How these were produced (methodology): the author's trtllm-serve SA framework was run with only the server swapped to Dynamo (DISAGG_BACKEND=dynamo) — identical allocation, /raid staging, run_benchmark.sh multi-concurrency client, and get_disagg_e2e_metrics.py metric (incl. the TTFT<5000ms frontier selection). Result: Dynamo reproduces the trtllm-serve spreadsheet within ±1.6% (MTP-off) / ±3.5% (MTP-on) on both tput_per_user and tput_per_gpu.

Same four srt-slurm default overrides as #131 (dynamo.install:false, frontend.enable_multiple_frontends:false, benchmark.num_prompts_mult:1, benchmark.use_chat_template:false).

Worker env reflects the executed 8k/1k sweep (UCX_TLS, TRTLLM_ENABLE_PDL, *_DISABLE_GC, NCCL_GRAPH_MIXING_SUPPORT=0, MIMALLOC_PURGE_DELAY=0; ctx adds PYTORCH_CUDA_ALLOC_CONF=expandable_segments). This differs from #131's GB200 set — no UCX_CUDA_IPC_ENABLE_MNNVL/UCX_RNDV_SCHEME/TRTLLM_KVCACHE_HOST_SIZE_OVERRIDE, and (e2e mode, not gen-only) no per-c TLLM_BENCHMARK_REQ_QUEUES_SIZE.

Review notes: (1) model.path is the cmh checkpoint path (/lustre/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/models/dsv4-pro) — override per cluster. (2) precision dir is mxfp4 (accurate), unlike the existing DSV4-Pro gb200_nvfp4; happy to rename for consistency if the team prefers.

Test plan

  • Sweep executed end-to-end (26/26 frontier points) and pareto curve published (Dynamo vs trtllm-serve)
  • srtctl dry-run validates representative recipes — DEP32 c=154 (9-node), TEP4 c=5 (6-node), DEP8 c=4301_mtp1 (14-node), DEP16 c=1229_mtp3 (14-node); node counts + env resolve correctly
  • Rerun a subset from these committed recipes to confirm reproducibility

26 frontier operating points (13 MTP-off, 11 MTP3, 2 MTP1) across TEP4/TEP8/DEP8/DEP16/DEP32,
byte-translated from the executed Dynamo 8k/1k sweep. Precision = MXFP4 experts + UE8M0-FP8 dense.
Produced by running the trtllm-serve SA framework with only the server swapped to Dynamo; reproduces
the published spreadsheet within +/-1.6% (MTP-off) / +/-3.5% (MTP-on) on tput_per_user and tput_per_gpu.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (sa-submission-q2-2026@9ecc31f). Learn more about missing BASE report.

Additional details and impacted files
@@                   Coverage Diff                    @@
##             sa-submission-q2-2026     #192   +/-   ##
========================================================
  Coverage                         ?   61.51%           
========================================================
  Files                            ?       48           
  Lines                            ?     4176           
  Branches                         ?        0           
========================================================
  Hits                             ?     2569           
  Misses                           ?     1607           
  Partials                         ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@richardhuo-nv

Copy link
Copy Markdown
Collaborator

Also let's use the folder structure like this:
https://github.com/NVIDIA/srt-slurm/tree/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP

make hardware/percision/islosl/MTP/STP into the path and put the configs into the pathes that belongs to.

# default overrides (dynamo.install, multi_frontend, num_prompts_mult, use_chat_template).
name: dsv4_pro_mxfp4_ISL8K_OSL1K_dep16_c1229_eplb384_mtp0
model:
path: /lustre/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/models/dsv4-pro

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just put the huggingface model id for now. This need to point to the model path in sa cluster.

name: dsv4_pro_mxfp4_ISL8K_OSL1K_dep16_c1229_eplb384_mtp0
model:
path: /lustre/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/models/dsv4-pro
container: nvcr.io/nvstaging/ai-dynamo/tensorrtllm-runtime@sha256:6aa381ac47bf7f5d0ef0598a1cab97dc0005e01c41da104f420966373d9a09e4

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's kick off a release and put real released dynamo image.

max_seq_len: 9256
moe_config:
backend: MEGAMOE_DEEPGEMM
load_balancer: /scratch/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/users/yna/dsv4-exp/framework_dynamo/deepseek-V4-Pro/offline_eplb_confs/moe_load_balancer_gen_ep16_slots384.yaml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to make sure this file is part of the container mount.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest put the file as part of the config folder in the srt-slurm. And mount the config folder as the container mount.

osl: 1024
concurrencies: '1229'
req_rate: inf
num_prompts_mult: 1

@richardhuo-nv richardhuo-nv Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure the mult is just 1? Will the perf number even stablize? in the past, we normally do 10 or 16.

concurrencies: '1229'
req_rate: inf
num_prompts_mult: 1
use_chat_template: false

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for MTP, use chat template is a must have AFAIK, please double confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants