Add DeepSeek-V4-Pro 8k/1k SA recipes for GB300 (MTP-off + MTP-on, MXFP4)#192
Add DeepSeek-V4-Pro 8k/1k SA recipes for GB300 (MTP-off + MTP-on, MXFP4)#192nv-yna wants to merge 1 commit into
Conversation
26 frontier operating points (13 MTP-off, 11 MTP3, 2 MTP1) across TEP4/TEP8/DEP8/DEP16/DEP32, byte-translated from the executed Dynamo 8k/1k sweep. Precision = MXFP4 experts + UE8M0-FP8 dense. Produced by running the trtllm-serve SA framework with only the server swapped to Dynamo; reproduces the published spreadsheet within +/-1.6% (MTP-off) / +/-3.5% (MTP-on) on tput_per_user and tput_per_gpu. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## sa-submission-q2-2026 #192 +/- ##
========================================================
Coverage ? 61.51%
========================================================
Files ? 48
Lines ? 4176
Branches ? 0
========================================================
Hits ? 2569
Misses ? 1607
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Also let's use the folder structure like this: make hardware/percision/islosl/MTP/STP into the path and put the configs into the pathes that belongs to. |
| # default overrides (dynamo.install, multi_frontend, num_prompts_mult, use_chat_template). | ||
| name: dsv4_pro_mxfp4_ISL8K_OSL1K_dep16_c1229_eplb384_mtp0 | ||
| model: | ||
| path: /lustre/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/models/dsv4-pro |
There was a problem hiding this comment.
You can just put the huggingface model id for now. This need to point to the model path in sa cluster.
| name: dsv4_pro_mxfp4_ISL8K_OSL1K_dep16_c1229_eplb384_mtp0 | ||
| model: | ||
| path: /lustre/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/models/dsv4-pro | ||
| container: nvcr.io/nvstaging/ai-dynamo/tensorrtllm-runtime@sha256:6aa381ac47bf7f5d0ef0598a1cab97dc0005e01c41da104f420966373d9a09e4 |
There was a problem hiding this comment.
Let's kick off a release and put real released dynamo image.
| max_seq_len: 9256 | ||
| moe_config: | ||
| backend: MEGAMOE_DEEPGEMM | ||
| load_balancer: /scratch/fsw/portfolios/coreai/projects/coreai_comparch_inferencex/users/yna/dsv4-exp/framework_dynamo/deepseek-V4-Pro/offline_eplb_confs/moe_load_balancer_gen_ep16_slots384.yaml |
There was a problem hiding this comment.
you will need to make sure this file is part of the container mount.
There was a problem hiding this comment.
I would suggest put the file as part of the config folder in the srt-slurm. And mount the config folder as the container mount.
| osl: 1024 | ||
| concurrencies: '1229' | ||
| req_rate: inf | ||
| num_prompts_mult: 1 |
There was a problem hiding this comment.
are we sure the mult is just 1? Will the perf number even stablize? in the past, we normally do 10 or 16.
| concurrencies: '1229' | ||
| req_rate: inf | ||
| num_prompts_mult: 1 | ||
| use_chat_template: false |
There was a problem hiding this comment.
for MTP, use chat template is a must have AFAIK, please double confirm.
Summary
26 srt-slurm recipes for the DeepSeek-V4-Pro 8k/1k disagg SA reproduction sweep on GB300. Engine configs are byte-translated from the executed-sweep
ctx_config.yaml+gen_config.yaml— the Dynamo (dynamo.frontend+dynamo.trtllm) runs that produced the published 8k/1k pareto. One recipe per frontier operating point, across 5 decode topologies × 3 MTP modes:c=4; TEP4c=5/15/25/55; DEP32c=154/308/615/1127; DEP16c=1229/2253; DEP8c=2253/4301c=8; TEP4c=10/15/30; DEP32c=84/180/333/615; DEP16c=666/1229; DEP8c=1229c=2253/4301Precision = MXFP4 (verified from the checkpoint tensor dtypes: MoE experts FP4 with E8M0 block-32 scales, dense UE8M0-FP8, kv-cache fp8) — hence the
gb300_mxfp4/dir (notnvfp4). Container pinned to the1.3.0rc15.post1image-index digest.How these were produced (methodology): the author's trtllm-serve SA framework was run with only the server swapped to Dynamo (
DISAGG_BACKEND=dynamo) — identical allocation, /raid staging,run_benchmark.shmulti-concurrency client, andget_disagg_e2e_metrics.pymetric (incl. the TTFT<5000ms frontier selection). Result: Dynamo reproduces the trtllm-serve spreadsheet within ±1.6% (MTP-off) / ±3.5% (MTP-on) on bothtput_per_userandtput_per_gpu.Same four srt-slurm default overrides as #131 (
dynamo.install:false,frontend.enable_multiple_frontends:false,benchmark.num_prompts_mult:1,benchmark.use_chat_template:false).Worker env reflects the executed 8k/1k sweep (UCX_TLS, TRTLLM_ENABLE_PDL, *_DISABLE_GC, NCCL_GRAPH_MIXING_SUPPORT=0, MIMALLOC_PURGE_DELAY=0; ctx adds PYTORCH_CUDA_ALLOC_CONF=expandable_segments). This differs from #131's GB200 set — no
UCX_CUDA_IPC_ENABLE_MNNVL/UCX_RNDV_SCHEME/TRTLLM_KVCACHE_HOST_SIZE_OVERRIDE, and (e2e mode, not gen-only) no per-cTLLM_BENCHMARK_REQ_QUEUES_SIZE.Test plan
srtctl dry-runvalidates representative recipes — DEP32c=154(9-node), TEP4c=5(6-node), DEP8c=4301_mtp1(14-node), DEP16c=1229_mtp3(14-node); node counts + env resolve correctly