[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 823189e by seungrokj · Pull Request #1709 · SemiAnalysisAI/InferenceX

seungrokj · 2026-06-11T06:08:14Z

Summary

Add qwen3.5-fp4-mi355x-sglang-agentic-hicache config: SGLang agentic-coding sweep with and without hicache offloading (TP2, EP1)
Add minimaxm2.5-fp4-mi355x-vllm-agentic-lmcache config: vLLM agentic-coding sweep with lmcache
Add new agentic benchmark scripts: minimaxm2.5_fp4_mi355x.sh, qwen3.5_fp4_mi355x.sh
Update existing agentic scripts: glm5.1_fp4_mi355x.sh, kimik2.5_fp4_mi355x.sh, minimaxm2.5_fp8_mi355x.sh, qwen3.5_fp8_mi355x.sh
Update launch_mi355x-amds.sh

Test plan

Verify hicache/lmcache agentic configs run correctly on MI355X
Confirm new agentic scripts launch without errors

🤖 Generated with Claude Code

Note

Medium Risk
Large CI config and container pin churn affects many GPU sweeps; PD disagg plus HiCache/Mooncake and nightly vLLM images add operational and regression risk on shared clusters.

Overview
Expands AMD MI355X benchmark coverage for agentic-coding and DeepSeek-V4 while rebasing image pins and keeping several fixed-seq-len entries aligned with origin/main via separate -agentic / -agentic-hicache YAML keys.

amd-master.yaml bumps SGLang/vLLM/Atom images, simplifies some Qwen3.5 fixed-seq search spaces, and adds net-new scenarios: DSv4 FP4 PD-disagg (dsv4-fp4-mi355x-sglang-disagg), agentic sweeps with CPU / HiCache / LMCache offload, and relocated agentic-only recipes (Kimi, MiniMax, GLM, Qwen) so main-line throughput configs stay unchanged. Some vLLM pins move to v0.21.0 or nightlies for ROCm KV offload; one top-level agentic block is commented out to limit sweep cost during image validation.

CI passes offloading from matrix config into e2e/sweep workflows and surfaces it in multinode job names.

Harness/runtime: default agentic trace loaders switch to 061526 corpora; new multi-node agentic launchers for DSR1/DSV4 disagg drive HiCache/Mooncake env. amd_utils gains HiCache config via bind-mounted env, trace_replay.sh for PD agentic runs, DeepSeek-V4-Pro in models.yaml, decoupled EP vs DP CLI flags, DSv4 bench --dsv4 framing, router circuit-breaker tweaks, and SGLang startup patches (disagg bootstrap desync, optional host-pool assert). Single-node agentic scripts add or extend DSv4/GLM HiCache and ATOM LMCache paths.

^{Reviewed by Cursor Bugbot for commit a493908. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-11T06:08:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-11T06:08:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor · 2026-06-11T06:09:45Z

+    $ASYNC_SCHEDULING_ARGS 
+    "${PREFIX_CACHE_ARGS[@]}"
+    "${OFFLOAD_ARGS[@]}"
+)


vLLM uses wrong model

High Severity

The vLLM command serves "$MODEL" and omits --served-model-name, while the script downloads weights into MODEL_PATH and build_replay_cmd sends --model $MODEL to aiperf. That breaks the usual MODEL_PATH + served-name pairing used by sibling agentic scripts and can fail when MODEL is a Hub id but weights live under MODEL_PATH.

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

cursor · 2026-06-11T06:09:45Z

    --mem-fraction-static 0.8 \
-    --context-length $MAX_MODEL_LEN \
+    "${CACHE_ARGS[@]}" \
+    "${WARMUP_ARGS[@]}" \


SGLang ignores MODEL_PATH

Medium Severity

SGLang is started with --model-path $MODEL and no --served-model-name, after the script may download into MODEL_PATH. Matrix jobs that set a local MODEL_PATH can still point the server at the Hub id, and the OpenAI model name may not match MODEL used by aiperf.

Additional Locations (1)

benchmarks/single_node/agentic/qwen3.5_fp4_mi355x.sh#L123-L141

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

cursor · 2026-06-11T06:09:45Z

+        cd LMCache
+        pip install -r requirements/build.txt 
+        CXX=hipcc BUILD_WITH_HIP=1 pip install -e .   --no-build-isolation
+        cd ..


LMCache clone not idempotent

Medium Severity

The lmcache path runs git clone https://github.com/LMCache/LMCache.git unconditionally. With set -e, a second run in the same working directory exits when LMCache already exists, so lmcache agentic jobs fail on retry or reuse of the job cwd.

Additional Locations (1)

benchmarks/single_node/agentic/minimaxm2.5_fp4_mi355x.sh#L149-L154

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

cursor · 2026-06-12T06:15:23Z

+
+python3 -m sglang.launch_server \
+    --attention-backend aiter \
+    --model-path $MODEL \


Server ignores MODEL_PATH

Medium Severity

Weights are downloaded into MODEL_PATH when the workflow sets that directory, but SGLang is started with --model-path $MODEL (Hub id) instead of MODEL_PATH. The server may load a different cache path than the one prepared for the job.

^{Reviewed by Cursor Bugbot for commit 32f5007. Configure here.}

cursor · 2026-06-12T08:06:30Z

+
+# ---- Resolve traces and install deps ----------------------------------------
+# https://huggingface.co/datasets/semianalysisai/cc-traces-weka-with-subagents-060826
+export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_060826


DSv4 atom uncapped traces

Medium Severity

This new DSv4 ATOM agentic script sets WEKA_LOADER_OVERRIDE to the uncapped 060826 trace set, while peer MI355X agentic scripts in the same PR use 060226_256k to avoid ~1M-token traces that are rejected and skew sweeps.

^{Reviewed by Cursor Bugbot for commit 351e729. Configure here.}

cursor · 2026-06-12T08:45:54Z

+    $ASYNC_SCHEDULING_ARGS 
+    "${PREFIX_CACHE_ARGS[@]}"
+    "${OFFLOAD_ARGS[@]}"
+)


MiniMax FP8 launcher regressed

High Severity

The MI355X MiniMax FP8 agentic launcher was replaced with a Kimi-style vLLM recipe. Existing minimaxm2.5-fp8-mi355x-vllm-agentic jobs (TP4/EP4, offloading=cpu) lose the prior --max-model-len, ROCM_AITER_UNIFIED_ATTN backend, MODEL_PATH-based serve, and SimpleCPU offload wiring they depended on.

^{Reviewed by Cursor Bugbot for commit faba18f. Configure here.}

cursor · 2026-06-15T00:22:06Z

-    --cuda-graph-max-bs "$PER_ENGINE_MAX_RUNNING" \
+    --disable-radix-cache \
+    --attention-backend dsv4 \
+    --max-running-requests ${CONC} \


DP max-running requests wrong

Medium Severity

When DP_ATTENTION=true, the script computes PER_ENGINE_MAX_RUNNING as CONC/TP for per-engine limits, but the server is started with --max-running-requests ${CONC}. Each DP engine may accept too many sequences versus the harness load-balancing assumption.

^{Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.}

cursor · 2026-06-15T02:56:39Z

-python3 -m sglang.launch_server \
-    --model-path "$MODEL_PATH" --served-model-name "$MODEL" \
+sglang serve \
+    --model-path $MODEL \


Wrong model path for serve

Medium Severity

The script downloads weights into MODEL_PATH when set, but sglang serve uses --model-path $MODEL (Hub id) instead of "$MODEL_PATH". Runs that pre-stage a local directory can ignore the prepared path and rely on a different cache location.

^{Reviewed by Cursor Bugbot for commit 4ebc4e2. Configure here.}

cursor · 2026-06-15T15:47:50Z

-  multinode: false
+  framework: sglang-disagg
+  multinode: true
+  disagg: true


Disagg agentic uses wrong runner

High Severity

dsr1-fp4-mi355x-sglang-disagg-agentic-hicache sets runner: mi355x while sibling PD-disagg entries (including dsv4-fp4-mi355x-sglang-disagg-agentic-hicache) use runner: mi355x-disagg. Multinode jobs use runs-on: ${{ inputs.runner }}, so the DSR1 agentic disagg matrix likely schedules on the wrong runner class.

Additional Locations (1)

.github/configs/amd-master.yaml#L2848-L2857

^{Reviewed by Cursor Bugbot for commit a9e1304. Configure here.}

cursor · 2026-06-15T15:47:50Z

+        # node budget. Lower TP configs use higher ratios to maintain adequate
+        # host token capacity without exceeding DRAM limits.
+        if [ "$TP" -ge 8 ]; then
+            DEFAULT_HICACHE_RATIO=2


DSv4 HiCache ratio inconsistent

Low Severity

Both DSv4 MI355X agentic SGLang launchers share the same HiCache comment for TP≥8, but dsv4_fp4_mi355x_sglang.sh defaults DEFAULT_HICACHE_RATIO to 2 while dsv4_fp4_mi355x.sh uses 8, so identical YAML sweeps get different CPU tier sizing.

Additional Locations (1)

benchmarks/single_node/agentic/dsv4_fp4_mi355x.sh#L73-L75

^{Reviewed by Cursor Bugbot for commit a9e1304. Configure here.}

cursor · 2026-06-15T15:47:50Z

-        TOTAL_CPU_DRAM_GB=2500
+        #TODO: fix
+        TOTAL_CPU_DRAM_GB=3000
+        TOTAL_CPU_DRAM_PARTITION_GB="${TOTAL_CPU_DRAM_PARTITION_GB:-$((TOTAL_CPU_DRAM_GB / (8 / TP)))}"


CPU offload divide by zero

Medium Severity

TOTAL_CPU_DRAM_PARTITION_GB uses $((TOTAL_CPU_DRAM_GB / (8 / TP))). For TP greater than 8, bash evaluates 8 / TP as 0 and arithmetic expansion errors on divide-by-zero.

Additional Locations (1)

benchmarks/single_node/agentic/minimaxm2.5_fp8_mi355x.sh#L124-L126

^{Reviewed by Cursor Bugbot for commit a9e1304. Configure here.}

cursor · 2026-06-16T06:58:57Z

    decode_dp_ranks=$DECODE_TP_SIZE
    MORI_MAX_DISPATCH_TOKENS_DECODE=$((BENCH_MAX_CONC_VALUE / decode_dp_ranks))
-    MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MAX_DISPATCH_TOKENS_DECODE * decode_dp_ranks * 7 / 10))
+    # MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MAX_DISPATCH_TOKENS_DECODE * decode_dp_ranks * 7 / 10))


Disagg MoE token overrides removed

Medium Severity

This change comments out assignments that set MORI_MOE_MAX_INPUT_TOKENS_PREFILL and MORI_MOE_MAX_INPUT_TOKENS_DECODE for DP+EP and MTP decode paths, while launch commands still conditionally export those variables. Disagg sweeps that relied on the computed caps may run with unset MoE input limits.

^{Reviewed by Cursor Bugbot for commit c21ad06. Configure here.}

cursor · 2026-06-16T13:58:08Z

    fi
    set +x
-    PREFILL_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_PREFILL} ${PREFILL_SDMA_ENV} ${PREFILL_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_MAX_DISPATCH_TOKENS_PREFILL} python3 -m sglang.launch_server \
+    PREFILL_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_PREFILL} ${PREFILL_SDMA_ENV} ${PREFILL_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK_PREFILL:-${MORI_MAX_DISPATCH_TOKENS_PREFILL}} MORI_IO_SQ_BACKOFF_TIMEOUT_US=${MORI_IO_SQ_BACKOFF_TIMEOUT_US} MORI_IO_QP_MAX_SEND_WR=${MORI_IO_QP_MAX_SEND_WR} ${LAUNCH_PREFIX:-} python3 -m sglang.launch_server \


Server ignores resolved MODEL_PATH

Medium Severity

job.slurm now resolves and exports a canonical MODEL_PATH (caller path, hf_dir, or MODEL_DIR/MODEL_NAME), but server_sglang.sh still launches with --model-path $MODEL_DIR/$MODEL_NAME. When the resolved path differs from that join, prefill/decode can fail to load weights or load from the wrong directory.

Additional Locations (1)

benchmarks/multi_node/amd_utils/job.slurm#L209-L242

^{Reviewed by Cursor Bugbot for commit c7f269e. Configure here.}

cursor · 2026-06-17T00:23:37Z

    fi
    set +x
-    DECODE_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_DECODE} ${DECODE_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_MAX_DISPATCH_TOKENS_DECODE} python3 -m sglang.launch_server \
+    DECODE_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_DECODE} ${DECODE_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK_DECODE:-${MORI_MAX_DISPATCH_TOKENS_DECODE}} MORI_IO_SQ_BACKOFF_TIMEOUT_US=${MORI_IO_SQ_BACKOFF_TIMEOUT_US} MORI_IO_QP_MAX_SEND_WR=${MORI_IO_QP_MAX_SEND_WR} ${LAUNCH_PREFIX:-} python3 -m sglang.launch_server \


Custom all-reduce flag unused

Medium Severity

DISABLE_CUSTOM_ALL_REDUCE is threaded into the container from job.slurm, and the DSR1 disagg agentic recipe defaults it to 1 for an Aiter fault workaround, but prefill/decode launch commands never append --disable-custom-all-reduce.

^{Reviewed by Cursor Bugbot for commit b5626fb. Configure here.}

cursor · 2026-06-19T04:20:53Z

        # LMCache backend.
-        TOTAL_CPU_DRAM_GB=2500
+        #TODO: fix
+        TOTAL_CPU_DRAM_GB=3000


Ignores TOTAL_CPU_DRAM_GB env

Medium Severity

The scripts require TOTAL_CPU_DRAM_GB via check_env_vars, then hardcode TOTAL_CPU_DRAM_GB=3000 inside the cpu and lmcache branches. Workflow-supplied offload memory sizing is discarded and partition math uses the fixed constant instead.

Additional Locations (1)

benchmarks/single_node/agentic/dsv4_fp4_mi355x_atom.sh#L113-L155

^{Reviewed by Cursor Bugbot for commit 73756ab. Configure here.}

cursor · 2026-06-23T03:34:25Z

+mkdir -p "$RESULT_DIR"
+
+export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_061526
+resolve_trace_source


Disagg trace corpus not 256k capped

Medium Severity

Multinode agentic replay hardcodes WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_061526 (full corpus). Single-node agentic scripts use _256k variants for non-DSv4 models, while benchmark_lib.sh now defaults non-dsv4 recipes to 061526_256k. Disagg agentic jobs can load traces that exceed max_model_len and fail or skew results.

^{Reviewed by Cursor Bugbot for commit 219bcc9. Configure here.}

cursor · 2026-06-23T03:48:33Z

+
+if [ "$ANY_FAILED" -ne 0 ]; then
+    echo "WARNING: at least one conc had a non-zero exit; per-conc result files were still written when possible." >&2
+fi


Replay failures do not fail job

Medium Severity

When any concurrency sweep in trace_replay.sh fails, the script only prints a warning and exits 0, so SLURM and CI can treat a broken agentic disagg run as success without a valid aggregate result.

^{Reviewed by Cursor Bugbot for commit 7c2bcbc. Configure here.}

cursor · 2026-06-23T03:48:33Z

+        --gpu-memory-utilization 0.85 \
+    "${PREFIX_CACHE_ARGS[@]}"
+    "${OFFLOAD_ARGS[@]}"
+)


DSv4 atom script is Kimi

High Severity

The added dsv4_fp4_mi355x_atom.sh is a Kimi-K2.5 ATOM/vLLM offload recipe (comments, LMCache clone, --kv_offloading_backend, atom.entrypoints.openai_server) and does not implement DeepSeek-V4-Pro ATOM serving despite the filename and PR intent.

^{Reviewed by Cursor Bugbot for commit 7c2bcbc. Configure here.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ebug run (conc 32, 1800s) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… debug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_policy=ignore for dsv4-fp4-mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r config for dsv4-fp4-mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

…to 8, and total-cpu-dram-gb to 1500 for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

…pu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l to rdma for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…x_mr_size limit on mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

…355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… rdma protocol for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… registration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

…lock already 378GB) for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ST_DEVICE_AFFINITY=1 for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…55x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

…_master start for segment registration on mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

seungrokj requested review from 1am9trash, billishyahao, chunfangamd and yctseng0211 as code owners June 11, 2026 06:08

github-project-automation Bot added this to InferenceMAX Board Jun 11, 2026

seungrokj mentioned this pull request Jun 11, 2026

[DNM][AMD] agentx-v0.4 #1654

Closed

cursor Bot reviewed Jun 11, 2026

View reviewed changes

seungrokj changed the title ~~[AMD] agentic: add hicache/lmcache configs, update agentic scripts for mi355x models~~ [DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61 Jun 11, 2026

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh Outdated

cursor Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated

Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated

cursor Bot reviewed Jun 15, 2026

View reviewed changes

ichbinblau force-pushed the amd/agentx-v0.4_rebase0611 branch from 067484f to a9e1304 Compare June 15, 2026 15:35

cursor Bot reviewed Jun 15, 2026

View reviewed changes

cursor Bot reviewed Jun 16, 2026

View reviewed changes

ichbinblau force-pushed the amd/agentx-v0.4_rebase0611 branch from c7f269e to e37fbc2 Compare June 16, 2026 14:35

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread .github/configs/amd-master.yaml

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread benchmarks/multi_node/agentic/dsr1_fp4_mi355x_sglang-disagg.sh

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread benchmarks/single_node/agentic/glm5.1_fp4_mi355x.sh

seungrokj changed the title ~~[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61~~ [DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 823189e Jun 19, 2026

cursor Bot reviewed Jun 19, 2026

View reviewed changes

cursor Bot reviewed Jun 23, 2026

View reviewed changes

seungrokj and others added 30 commits June 28, 2026 14:54

[AMD] increase dsv4-fp4-mi355x-vllm-agentic-lmcache duration to 10800s

35cccc5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] switch dsv4-fp4-mi355x-vllm-agentic-lmcache to cpu offloading d…

1381a92

…ebug run (conc 32, 1800s) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] clear mooncake device_name for dsv4-fp4-mi355x vllm lmcache

d393545

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] reduce total-cpu-dram-gb to 600 for dsv4-fp4-mi355x cpu offload…

307855a

… debug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] reduce mooncake local_buffer_size to 200MB to fix RDMA ENOMEM o…

9138294

…n mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] switch mooncake protocol to tcp to bypass RDMA issues on mi355x

b5eb1f2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] increase MOONCAKE_KV_LEASE_TTL to 3600s and add kv_load_failure…

984cee1

…_policy=ignore for dsv4-fp4-mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] revert kv_load_failure_policy=ignore from MooncakeStoreConnecto…

7e9c55a

…r config for dsv4-fp4-mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

update con

01cee2a

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

[AMD] MINIMAX-M3 FP4 VLLM LMCACHE AGENTIC

f8e5eb2

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

Merge branch 'amd/agentx-v0.4_rebase0611' of https://github.com/SemiA…

8e84e0f

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

[AMD] MINIMAX-M3 FP4 VLLM LMCACHE AGENTIC

d1fba1e

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

[AMD] MINIMAX-M3 FP4 VLLM LMCACHE AGENTIC

47e5ff3

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

[AMD] increase Mooncake local_buffer_size to 4GB, MC_WORKERS_PER_CTX …

18c80c3

…to 8, and total-cpu-dram-gb to 1500 for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] MINIMAX-M3 FP4 VLLM LMCACHE AGENTIC

36b63b3

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

Merge branch 'amd/agentx-v0.4_rebase0611' of https://github.com/SemiA…

fdcb3d4

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

[AMD] install RDMA/libcurl deps before mooncake for dsv4-fp4-mi355x c…

94b2e43

…pu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] build mooncake from source with RDMA support and switch protoco…

e603e80

…l to rdma for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] reduce Mooncake local_buffer_size to 2GB to stay within RDMA ma…

464547b

…x_mr_size limit on mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] MINIMAX-M3 FP4 VLLM LMCACHE AGENTIC

73cd5ad

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

[AMD] switch Mooncake protocol back to tcp to avoid RDMA ENOMEM on mi…

963776a

…355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] disable patch_vllm_mooncake_transfer_batches and switch back to…

56d9f2e

… rdma protocol for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] add --ulimit memlock=-1 to srun for mi355x to allow RDMA memory…

75ff18c

… registration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add more cache related metrics

b4a1d8a

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

[AMD] switch Mooncake back to tcp and remove --ulimit memlock=-1 (mem…

fe283b8

…lock already 378GB) for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] set VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 and MC_ENABLE_DE…

a6406af

…ST_DEVICE_AFFINITY=1 for dsv4-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] increase Mooncake local_buffer_size to 4GB for tcp mode on dsv4…

cd1d383

…-fp4-mi355x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] re-enable patch_vllm_mooncake_transfer_batches for dsv4-fp4-mi3…

c7fdde3

…55x cpu offload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

remove con=40

3c5dda2

Signed-off-by: Theresa Shan <theresa.shan@amd.com>

[AMD] reduce local_buffer_size to 2GB and add sleep 10 after mooncake…

4263181

…_master start for segment registration on mi355x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Uh oh!

Conversation

seungrokj commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

vLLM uses wrong model

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

SGLang ignores MODEL_PATH

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

LMCache clone not idempotent

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Server ignores MODEL_PATH

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

DSv4 atom uncapped traces

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

MiniMax FP8 launcher regressed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

DP max-running requests wrong

Uh oh!

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Wrong model path for serve

Uh oh!

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Disagg agentic uses wrong runner

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

DSv4 HiCache ratio inconsistent

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

CPU offload divide by zero

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

Disagg MoE token overrides removed

Uh oh!

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

Server ignores resolved MODEL_PATH

Uh oh!

cursor Bot Jun 17, 2026

Choose a reason for hiding this comment

Custom all-reduce flag unused

Uh oh!

Uh oh!

Uh oh!

seungrokj commented Jun 11, 2026 •

edited by cursor Bot

Loading