Skip to content

Amd/vllm disagg minimax fp8 cdna3#1949

Draft
haic0 wants to merge 98 commits into
mainfrom
amd/vllm_disagg_minimax_fp8_cdna3
Draft

Amd/vllm disagg minimax fp8 cdna3#1949
haic0 wants to merge 98 commits into
mainfrom
amd/vllm_disagg_minimax_fp8_cdna3

Conversation

@haic0

@haic0 haic0 commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

ichbinblau and others added 30 commits May 13, 2026 13:53
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Chun Fang <chun.fang@amd.com>
---------

Signed-off-by:  Simon Danielsson <pedaniel@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Shan Theresa <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
…_chat_completions

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The inline collect_latest_results.py hardcoded "sglang" as the log
directory prefix, causing "No logs directory found" for vllm-disagg
runs where bench.sh creates directories named vllm-disagg_isl_X_osl_Y.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The vllm-router runs as a separate container on node 0. After node 0's
main container finishes the benchmark and exits, decode nodes remain
stuck waiting for the router port to close. The router cleanup in
job.slurm can't run until srun completes, but srun can't complete
because decode nodes are blocked — deadlock.

Fix: skip exec on rank 0 for vllm-disagg so the srun bash script
continues after docker exits and can stop the router container,
allowing decode nodes to detect the port closure and exit.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The EXIT trap deleted benchmark_logs/ before saving artifacts, making
it impossible to debug container startup failures. Now the trap always
copies slurm .out/.err to the artifact directory and prints the last
100 lines of .err inline in the CI output.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The batch host has docker socket permissions but the compute nodes
do not, causing "permission denied" on all srun tasks. Move the
detection after SELECTED_NODES is known and probe via srun.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Export DOCKER_CMD_DETECT as a shell snippet that each srun participant
evaluates locally, instead of testing a single node and assuming all
nodes have the same docker socket permissions.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add perf-changelog entries for kimik2.5-fp4-mi355x-vllm-disagg and
  minimaxm2.5-fp8-mi355x-vllm-disagg to trigger CI benchmarks
- Update kimi 1k1k conc-list from [8] to [16]
- Comment out kimi 8k1k config until eval pipeline is wired up

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment out 1k1k config and enable 8k1k with conc-list [16] so
mark_eval_entries picks it up for the eval pipeline.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Set --served-model-name on all prefill/decode vllm serve commands so
the model name matches what run_lm_eval sends in API requests. Also
add eval pipeline support (health check, run_eval, artifact staging)
mirroring server_sglang.sh.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
bench.sh now uses MODEL_NAME for vllm-disagg to match
--served-model-name, and MODEL_PATH for sglang to match its default.
Simplified SERVED_MODEL to use MODEL_NAME directly since MODEL env
var is not available inside the container.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
ichbinblau and others added 23 commits May 19, 2026 14:37
bench.sh now uses MODEL_NAME for vllm-disagg to match
--served-model-name, and MODEL_PATH for sglang to match its default.
Simplified SERVED_MODEL to use MODEL_NAME directly since MODEL env
var is not available inside the container.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
benchmark_lib.sh rejected unknown flags — add --tokenizer support so
vllm-disagg bench can resolve the tokenizer from the local model path
instead of attempting an HF download with the short model name.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: Shan Theresa <theresa.shan@amd.com>
restore the kimi k2.5 settings
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
…to amd/vllm_disagg_minimax_fp8_cdna3

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…barrier

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
haic0 and others added 6 commits June 28, 2026 14:33
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: haic0 <haic0@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants