[AMD] Tune MiniMax-M3 MXFP8 MI300X vLLM: async scheduling + big-prefill, fix conc256 EP8→EP1#1951
[AMD] Tune MiniMax-M3 MXFP8 MI300X vLLM: async scheduling + big-prefill, fix conc256 EP8→EP1#1951ZhengGong-amd wants to merge 4 commits into
Conversation
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Cursor <cursoragent@cursor.com>
…P8->EP1 Stack the accuracy-safe scheduling levers found across the arbor tuning sessions on top of the AITER MI300X recipe: - --async-scheduling (overlap CPU input-prep with GPU decode) - --max-num-batched-tokens 16384 (amortize the per-step ~95 GB/rank BF16-emulated MoE weight read; halves prefill weight-reads vs the 8192 default) - amd-master.yaml: switch the 1k1k conc256 row from TP8/EP8 to TP8/EP1; the EP8 topology regressed high-concurrency throughput (434 vs 905 tok/s/gpu @ conc256) and EP1 matches the topology the AITER uplift was measured against. Both serve flags are token-for-token identical (scheduling only). Measured on 8xMI300X 1k1k vs the AITER baseline (total tok/s/gpu): conc256 434->905 (EP8->EP1 + levers, +108%), conc64 364->429 (+18%), conc128 585->628 (+7.3%); conc1-32 neutral. GSM8K exact-match 0.959. Co-authored-by: Cursor <cursoragent@cursor.com>
Condense the recipe header note and the amd-master.yaml search-space comment introduced in the previous commit; rationale/measurements live in the perf-changelog entry. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
Replace the PENDING placeholder on both new minimaxm3-fp8-mi300x-vllm entries with the canonical PR URL; PENDING is not in the accepted PR_LINK_PLACEHOLDERS set and fails validate_perf_changelog.py and the merge canonicalize step. Co-authored-by: Cursor <cursoragent@cursor.com>
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1942 | ||
|
|
||
| - config-keys: | ||
| - minimaxm3-fp8-mi300x-vllm |
There was a problem hiding this comment.
Duplicated blocks: line 4306 and line 4313
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, conc-start: 1, conc-end: 128 } | ||
| - { tp: 8, ep: 8, conc-start: 256, conc-end: 256 } |
There was a problem hiding this comment.
@ZhengGong-amd have you tested your method with the nightly image: https://hub.docker.com/r/vllm/vllm-openai-rocm/tags? For example, vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28353240593 |
Summary
Stacks two accuracy-safe scheduling levers on the
minimaxm3-fp8-mi300x-vllmrecipe and corrects a high-concurrency parallelism regression in its search space:benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh— add two server flags:--async-scheduling— overlaps CPU input-prep with GPU decode (token-for-token identical).--max-num-batched-tokens 16384— amortizes the per-step ~95 GB/rank BF16-emulated MoE weight read (gfx942 has no native MXFP8 MoE GEMM, so MXFP8 experts are dequantized to BF16 and re-read every prefill step) over a larger prefill batch, halving prefill weight-reads vs the vLLM 8192 default..github/configs/amd-master.yaml— switch the 1k1kconc256search-space row fromTP8/EP8toTP8/EP1. The EP8 topology regressed high-concurrency throughput (>2× slower at conc256 on this shape), so EP1 is used across the full 1k1k concurrency range.perf-changelog.yaml— append the required trigger entry forminimaxm3-fp8-mi300x-vllm.Both serve flags are scheduling-only (no numerics, no quantization, no reduction reassociation) and accuracy-safe. No image change; serving shape (block-size 128, BF16 KV,
TRITON_ATTN, MiniMax-M3 parsers, TP8) is otherwise unchanged. The MTP variant and the 8k1k EP8 rows are left untouched.Why
On the 1024/1024 closed-loop sweep the recipe was prefill-bound at high concurrency: synchronized
--ignore-eosprefill waves stole wall-clock from decode, and the default 8192 token budget needed multiple full MoE weight-reads to clear each wave. Raising the prefill budget plus async scheduling lifts the decode duty cycle. Separately, theconc256EP8 row was a regression — EP8 underperformed EP1 by >2× at conc256 on this shape.Validation
Measured on 8× MI300X (gfx942), TP8, random 1024-in/1024-out,
--request-rate inf --ignore-eos, vs the InferenceX MI300X stock vLLM FP8 reference curve (total tok/s/gpu):Low/mid concurrency (1–32) is latency-bound and tracks the reference within noise; the gains land at high concurrency (conc 64/128/256).
bash -non the recipe — OK.generate_sweep_configs.py full-sweep --model-prefix minimaxm3 --framework vllm --runner-type mi300x— generates the 1k1k EP1 sweep (conc256 now EP1); no schema errors.python3 utils/validate_perf_changelog.py— final newline present, matrix generated.python -m pytest utils/matrix_logic/ -q— passing.PR Review Checklist
As a PR author, I have:
minimaxm3-fp8-mi300x-vllm) already exists; this PR only tunes it.--async-schedulingand--max-num-batched-tokensare upstream vLLM V1 scheduler flags; no custom code.Additional Details
conc256gain is part topology fix, part scheduling levers: the prior 1k1kconc256EP8 row was >2× slower than EP1 on this shape, so switching it to EP1 is what lets the high-concurrency point reach 905 tok/s/gpu;conc64/conc128gains are pure scheduling-lever effects at fixed EP1.conc128/256EP8 rows are intentionally left unchanged pending 8k1k data.perf-changelog.yamlentries use the canonicalpr-link: .../pull/1951.