[AMD] Tune MiniMax-M3 MXFP8 MI300X vLLM: async scheduling + big-prefill, fix conc256 EP8→EP1 by ZhengGong-amd · Pull Request #1950 · SemiAnalysisAI/InferenceX

ZhengGong-amd · 2026-06-29T03:07:42Z

Summary

Stacks two accuracy-safe scheduling levers on the minimaxm3-fp8-mi300x-vllm recipe and corrects a high-concurrency parallelism regression in its search space:

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh — add two server flags:
- --async-scheduling — overlaps CPU input-prep with GPU decode (token-for-token identical).
- --max-num-batched-tokens 16384 — amortizes the per-step ~95 GB/rank BF16-emulated MoE weight read (gfx942 has no native MXFP8 MoE GEMM, so MXFP8 experts are dequantized to BF16 and re-read every prefill step) over a larger prefill batch, halving prefill weight-reads vs the vLLM 8192 default.
.github/configs/amd-master.yaml — switch the 1k1k conc256 search-space row from TP8/EP8 to TP8/EP1. The EP8 topology regressed high-concurrency throughput, and EP1 is the topology the prior AITER uplift was measured against.
perf-changelog.yaml — append the required trigger entry for minimaxm3-fp8-mi300x-vllm.

Both serve flags are scheduling-only (no numerics, no quantization, no reduction reassociation) and accuracy-safe. No image change; serving shape (block-size 128, BF16 KV, TRITON_ATTN, MiniMax-M3 parsers, TP8) is otherwise unchanged. The MTP variant and the 8k1k EP8 rows are left untouched.

Why

On the 1024/1024 closed-loop sweep the recipe was prefill-bound at high concurrency: synchronized --ignore-eos prefill waves stole wall-clock from decode, and the default 8192 token budget needed multiple full MoE weight-reads to clear each wave. Raising the prefill budget plus async scheduling lifts the decode duty cycle. Separately, the conc256 EP8 row was a regression — EP8 underperformed EP1 by >2× at conc256 on this shape.

Validation

Measured on 8× MI300X (gfx942), TP8, random 1024-in/1024-out, --request-rate inf --ignore-eos, vs the AITER baseline (minimaxm3-mi300x-aiter-tuning) measured back-to-back on the same host (total tok/s/gpu):

conc	baseline	this PR	Δ
64	364	429	+18.0%
128	585	628	+7.3%
256	434 (EP8)	905 (EP1)	+108% (topology fix + levers)
1–32	—	—	neutral (latency-bound)

conc256 905 tok/s/gpu also beats the InferenceX MI300X stock reference (782, +16%).
GSM8K (winner config, lm-eval 5-shot): exact_match 0.9591 (strict) / 0.9583 (flexible) — ≥ 0.85 gate, unchanged from baseline (scheduling-only).
bash -n on the recipe — OK.
generate_sweep_configs.py full-sweep --model-prefix minimaxm3 --framework vllm --runner-type mi300x — generates the 1k1k EP1 sweep (conc256 now EP1); no schema errors.
python3 utils/validate_perf_changelog.py — final newline present, matrix generated.
python -m pytest utils/matrix_logic/ -q — passing.

PR Review Checklist

As a PR author, I have:

Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation.
Verified that this PR passes evals (GSM8K exact_match 0.9591, ≥ 0.85).
Verified that speculative decoding PRs use chat templates to align the AL distribution to real world — N/A (non-MTP, no speculative decoding in this PR).
vLLM is a first-class engine for this hardware and the vLLM submission (minimaxm3-fp8-mi300x-vllm) already exists; this PR only tunes it.
Verified that the single-node recipe is similar to the official vLLM recipes: --async-scheduling and --max-num-batched-tokens are upstream vLLM V1 scheduler flags; no custom code.
If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional Details

This branch builds on minimaxm3-mi300x-aiter-tuning (the AITER-kernels enablement). If that PR has not merged, this PR's diff includes its commits; rebase/retarget once it lands.
The conc256 +108% figure combines the EP8→EP1 topology fix with the two scheduling levers; the conc64/conc128 gains (+18%/+7.3%) are pure scheduling-lever effects at fixed EP1.
8k1k throughput for this key was not separately re-measured; the two serve flags still apply to it (accuracy-safe), but its conc128/256 EP8 rows are intentionally left unchanged pending 8k1k data.
Replace the pr-link: .../pull/PENDING placeholders in perf-changelog.yaml with this PR's URL after opening.

Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Cursor <cursoragent@cursor.com>

Sync the branch with the latest upstream main (fork main force-synced to upstream). Resolve the perf-changelog.yaml conflict by taking main's version and re-appending the branch's own minimaxm3-fp8-mi300x-vllm AITER entry at the tail. The AITER target benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh auto-merged cleanly (main's SemiAnalysisAI#1837 image/FP8-KV change was reverted by SemiAnalysisAI#1857, so main's net change to that file is zero); the AITER env exports are preserved. Co-authored-by: Cursor <cursoragent@cursor.com>

…P8->EP1 Stack the accuracy-safe scheduling levers found across the arbor tuning sessions on top of the AITER MI300X recipe: - --async-scheduling (overlap CPU input-prep with GPU decode) - --max-num-batched-tokens 16384 (amortize the per-step ~95 GB/rank BF16-emulated MoE weight read; halves prefill weight-reads vs the 8192 default) - amd-master.yaml: switch the 1k1k conc256 row from TP8/EP8 to TP8/EP1; the EP8 topology regressed high-concurrency throughput (434 vs 905 tok/s/gpu @ conc256) and EP1 matches the topology the AITER uplift was measured against. Both serve flags are token-for-token identical (scheduling only). Measured on 8xMI300X 1k1k vs the AITER baseline (total tok/s/gpu): conc256 434->905 (EP8->EP1 + levers, +108%), conc64 364->429 (+18%), conc128 585->628 (+7.3%); conc1-32 neutral. GSM8K exact-match 0.959. Co-authored-by: Cursor <cursoragent@cursor.com>

Condense the recipe header note and the amd-master.yaml search-space comment introduced in the previous commit; rationale/measurements live in the perf-changelog entry. Co-authored-by: Cursor <cursoragent@cursor.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…mbo-tuning # Conflicts: # perf-changelog.yaml

ZhengGong-amd and others added 4 commits June 16, 2026 06:44

ZhengGong-amd requested a review from a team June 29, 2026 03:07

ZhengGong-amd requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 29, 2026 03:07

github-project-automation Bot added this to InferenceMAX Board Jun 29, 2026

claude Bot reviewed Jun 29, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into minimaxm3-mi300x-co…

d9c4176

…mbo-tuning # Conflicts: # perf-changelog.yaml

ZhengGong-amd closed this Jun 29, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] Tune MiniMax-M3 MXFP8 MI300X vLLM: async scheduling + big-prefill, fix conc256 EP8→EP1#1950

[AMD] Tune MiniMax-M3 MXFP8 MI300X vLLM: async scheduling + big-prefill, fix conc256 EP8→EP1#1950
ZhengGong-amd wants to merge 5 commits into
SemiAnalysisAI:mainfrom
ZhengGong-amd:minimaxm3-mi300x-combo-tuning

ZhengGong-amd commented Jun 29, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ZhengGong-amd commented Jun 29, 2026

Summary

Why

Validation

PR Review Checklist

Additional Details

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant