Amd/vllm disagg minimax fp8 cdna3 by haic0 · Pull Request #1949 · SemiAnalysisAI/InferenceX

haic0 · 2026-06-28T14:13:34Z

No description provided.

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Signed-off-by: Chun Fang <chun.fang@amd.com>

--------- Signed-off-by: Simon Danielsson <pedaniel@amd.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Signed-off-by: Shan Theresa <theresa.shan@amd.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

…_chat_completions Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The inline collect_latest_results.py hardcoded "sglang" as the log directory prefix, causing "No logs directory found" for vllm-disagg runs where bench.sh creates directories named vllm-disagg_isl_X_osl_Y. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The vllm-router runs as a separate container on node 0. After node 0's main container finishes the benchmark and exits, decode nodes remain stuck waiting for the router port to close. The router cleanup in job.slurm can't run until srun completes, but srun can't complete because decode nodes are blocked — deadlock. Fix: skip exec on rank 0 for vllm-disagg so the srun bash script continues after docker exits and can stop the router container, allowing decode nodes to detect the port closure and exit. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…tion Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The EXIT trap deleted benchmark_logs/ before saving artifacts, making it impossible to debug container startup failures. Now the trap always copies slurm .out/.err to the artifact directory and prints the last 100 lines of .err inline in the CI output. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The batch host has docker socket permissions but the compute nodes do not, causing "permission denied" on all srun tasks. Move the detection after SELECTED_NODES is known and probe via srun. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Export DOCKER_CMD_DETECT as a shell snippet that each srun participant evaluates locally, instead of testing a single node and assuming all nodes have the same docker socket permissions. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Add perf-changelog entries for kimik2.5-fp4-mi355x-vllm-disagg and minimaxm2.5-fp8-mi355x-vllm-disagg to trigger CI benchmarks - Update kimi 1k1k conc-list from [8] to [16] - Comment out kimi 8k1k config until eval pipeline is wired up Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Comment out 1k1k config and enable 8k1k with conc-list [16] so mark_eval_entries picks it up for the eval pipeline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Set --served-model-name on all prefill/decode vllm serve commands so the model name matches what run_lm_eval sends in API requests. Also add eval pipeline support (health check, run_eval, artifact staging) mirroring server_sglang.sh. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

bench.sh now uses MODEL_NAME for vllm-disagg to match --served-model-name, and MODEL_PATH for sglang to match its default. Simplified SERVED_MODEL to use MODEL_NAME directly since MODEL env var is not available inside the container. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

bench.sh now uses MODEL_NAME for vllm-disagg to match --served-model-name, and MODEL_PATH for sglang to match its default. Simplified SERVED_MODEL to use MODEL_NAME directly since MODEL env var is not available inside the container. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

benchmark_lib.sh rejected unknown flags — add --tokenizer support so vllm-disagg bench can resolve the tokenizer from the local model path instead of attempting an HF download with the short model name. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Signed-off-by: Shan Theresa <theresa.shan@amd.com>

restore the kimi k2.5 settings

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

…to amd/vllm_disagg_minimax_fp8_cdna3 Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…barrier Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

ichbinblau and others added 30 commits May 13, 2026 13:53

remove vllm disagg for dpsr1 and dpv3

f365372

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

consolidate amd_utils for sglang and vllm

08f4c5b

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

use vLLM router as default router for vllm disagg

98ad4f3

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

fix bugs

1dbaad8

Signed-off-by: Chun Fang <chun.fang@amd.com>

[AMD] Bump to nightly vllm and vllm-router images (#1208)

ee133d7

--------- Signed-off-by: Simon Danielsson <pedaniel@amd.com>

update vllm image and vllm router image

4940153

update the interface prefix for tw cluster

d100454

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

add deps for ib device auto-detection

2fa7ee3

Signed-off-by: Shan Theresa <theresa.shan@amd.com>

update vllm image

9115482

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

fix indentation and add missing finally block in async_request_openai…

784a5a0

…_chat_completions Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix tw-eth interface detection pattern in env.sh

d4e1daf

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix vllm-disagg config schema: use scenarios.fixed-seq-len

e2d3a28

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix vllm-disagg routing to multi_node benchmark subdir

83e7554

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

suppress tokenizer warnings and debug output in bench.sh

51c92a7

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

reduce vllm-disagg concurrency sweep to single point for faster itera…

73d649a

…tion Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

enable set -x around docker privilege detection for CI debugging

0454199

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

switch vllm-disagg to 8k1k config to trigger multi-node eval

5959f8d

Comment out 1k1k config and enable 8k1k with conc-list [16] so mark_eval_entries picks it up for the eval pipeline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

add multi-node eval feature

f479f0d

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

remove start_etcd.sh

7f80da7

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

change decode to 1, easier for testing

0238ad1

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Initial commit

7240dcf

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

feat: add configs for minimax on mi300 and mi325

41b2fc5

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

ichbinblau and others added 23 commits May 19, 2026 14:37

add token patch to bench for vllm

b0f116e

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

update vllm image for kimi2.5 and Minimax disagg.

a99c4f6

Signed-off-by: Shan Theresa <theresa.shan@amd.com>

Update setup_deps.sh

8a24fa6

Update amd-master.yaml

1967919

restore the kimi k2.5 settings

update req rate for vllm.

7987452

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

make the sglang env consistent with upstream

b9df2a0

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

node blacklist

7925efd

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Merge remote-tracking branch 'upstream/amd/vllm_disagg_mvp_dev_th' in…

e3d7c16

…to amd/vllm_disagg_minimax_fp8_cdna3 Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: re-add MORI_IO_TC envvars

abdbff6

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: add excluded nodes in MI325 cluster

28ae46b

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: update conc list and use 2p1d for 8k/1k high conc

80a5b37

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: set MORI-related envvars for vllm same as sgl

d383560

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: update exluded node now when more are down

3d962a7

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: update excluded mi300 nodes

2b99dcb

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: remove sudo from rm commands in mi325 runner

bbea3a7

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: update mi325 model path

ae7dca4

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: use a more random port than 5000 for initial container creation …

a7ae751

…barrier Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: add backup docker command if docker and sudo docker does not work

97c34d1

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: docker backup fix

95ac360

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: remove manual install of older libbnxt-re version

1c7e81c

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: preserve mi300x multinode launch diagnostics

e6ab686

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

github-project-automation Bot added this to InferenceMAX Board Jun 28, 2026

haic0 and others added 6 commits June 28, 2026 14:33

fix: use valid mi300x slurm excludes

650edf3

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fix: choose mi300x nodes with workspace access

aae9b41

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fix: stage mi300x multinode workspace

e49e6b5

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fix: align mi300x multinode cleanup

8e0bf53

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fix: wait for mi300x staging nodes

39716d7

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fix: probe all mi300x nodes for staging

ce883b4

Signed-off-by: haic0 <haic0@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Amd/vllm disagg minimax fp8 cdna3#1949

Amd/vllm disagg minimax fp8 cdna3#1949
haic0 wants to merge 98 commits into
mainfrom
amd/vllm_disagg_minimax_fp8_cdna3

haic0 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haic0 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants