A systematic framework for studying MoE inference optimization strategies including expert-aware batching, expert placement (co-located vs. distributed), and scaling analysis across single-GPU, multi-GPU, and multi-node configurations.
Targets both AMD MI300X (ROCm) and NVIDIA (CUDA) GPUs using upstream vLLM as the inference engine.
Paper: SIEDS 2026 submission
Mixture-of-Experts models activate only a subset of parameters per token, offering theoretical efficiency gains over dense models. However, deploying MoE models introduces unique bottlenecks:
- Communication overhead from dynamic expert routing (all-to-all dispatching)
- GPU under-utilization from uneven expert activation (load imbalance)
- Irregular memory access patterns that strain HBM bandwidth
This project provides a complete experimental framework to:
- Benchmark MoE inference across 8 models spanning 6.9B to 671B parameters
- Profile execution with torch profiler, rocprofv3 (AMD), and Nsight Systems (NVIDIA)
- Compare placement strategies: Tensor Parallelism (TP), Expert Parallelism (EP), and hybrids
- Analyze EP load balancing with Gini coefficient, imbalance ratio, and rebalancing recommendations
- Estimate optimal expert placement using a lightweight CPU-based ML predictor
- Scale experiments from 1 GPU to 4 nodes (32 GPUs)
- Study both autoregressive and diffusion-based MoE architectures
We use upstream vLLM as the inference engine. On AMD MI300X:
| Component | Version | Purpose |
|---|---|---|
| vLLM | >= 0.8.0 | OpenAI-compatible serving with EP support |
| ROCm | >= 6.3.1 | GPU compute platform |
| AITer | Integrated | AMD-optimized MoE/attention kernels |
| hipBLASLt | Integrated | High-performance GEMM |
| rocprofv3 | System | Hardware profiling (kernel traces, memory, HIP API) |
Docker image: rocm/vllm-dev:main (AMD's optimized vLLM build with AITer)
See AMD's vLLM Docker guide.
| Model | Total | Active | Experts | Top-k | Type | Min GPUs |
|---|---|---|---|---|---|---|
| LLaDA-MoE-7B | 7B | 1.4B | 8 | 2 | Diffusion MoE | 1 |
| LLaDA-8B | 8B | 8B | 1 (dense) | - | Diffusion | 1 |
| Qwen1.5-MoE-A2.7B | 14.3B | 2.7B | 60 | 4 | Autoregressive MoE | 1 |
| OLMoE-1B-7B | 6.9B | 1.3B | 64 | 8 | Autoregressive MoE | 1 |
| Mixtral-8x7B | 46.7B | 12.9B | 8 | 2 | Autoregressive MoE | 1 |
| Qwen2-57B-A14B | 57.4B | 14B | 64 | 8 | Autoregressive MoE | 1 |
| DBRX | 132B | 36B | 16 | 4 | Autoregressive MoE | 2 |
| DeepSeek-V3 | 671B | 37B | 256 | 8 | Autoregressive MoE | 8 |
LLaDA-MoE-7B is the first open-source MoE diffusion LLM, enabling comparison of expert parallelism between autoregressive and diffusion inference paradigms.
optimizing-moe-inference/
├── configs/
│ ├── models.yaml # Model registry (8 models)
│ └── experiments/ # Experiment configurations
│ ├── single_gpu.yaml
│ ├── multi_gpu.yaml
│ └── multi_node.yaml
├── docker/
│ ├── Dockerfile.rocm # AMD ROCm (upstream vLLM)
│ ├── Dockerfile.llada # ROCm image for LLaDA diffusion models
│ ├── Dockerfile.cuda # NVIDIA CUDA
│ └── docker-compose.yaml
├── src/
│ ├── inference/
│ │ ├── llada_engine.py # Custom LLaDA inference (single GPU)
│ │ ├── llada_distributed.py # Multi-GPU LLaDA with RCCL
│ │ ├── expert_parallel.py # EP placement strategies + dispatch
│ │ └── profiler.py # torch.profiler + rocprofv3 hooks
│ ├── benchmark/
│ │ ├── runner.py # Benchmark orchestrator (vLLM serve + bench)
│ │ ├── metrics.py # Metrics parsing (throughput, TTFT, ITL)
│ │ ├── report.py # Rich tables, CSV export
│ │ └── factorial_study.py # Controlled factorial experiment design
│ ├── placement/
│ │ ├── strategies.py # 5 placement strategies + memory estimation
│ │ ├── estimator.py # Placement recommendation engine
│ │ ├── predictor.py # CPU-based ML predictor (RandomForest)
│ │ └── load_balancing.py # EP load balance analysis + optimization
│ ├── profiling/
│ │ ├── torch_profiler.py # Torch profiler (both platforms)
│ │ ├── rocm_profiler.py # rocprofv3 wrapper (AMD)
│ │ └── nvidia_profiler.py # Nsight Systems wrapper (NVIDIA)
│ └── utils/
│ ├── config.py # YAML config loader, GPU vendor detection
│ └── gpu_info.py # GPU info (rocm-smi / nvidia-smi)
├── scripts/
│ ├── download_models.sh # Model downloader (HF_TOKEN from env)
│ ├── run_benchmark.sh # Main entry point (auto-detects GPU)
│ ├── run_llada_benchmarks.sh # LLaDA sweep runner (steps, gen length)
│ ├── run_profiling.sh # Profiling (torch/rocprof/nsight)
│ └── setup_env.sh # Environment setup
├── analysis/
│ ├── plot_results.py # Throughput/latency plots
│ ├── plot_load_balance.py # EP load balance heatmaps
│ └── generate_tables.py # LaTeX tables for paper
├── experiments/ # Per-experiment notes
├── results/ # Output (gitignored)
├── Report1.md # Benchmark results from MI300X experiments
├── requirements.txt
├── setup.py
└── LICENSE # MIT
git clone https://github.com/ananya-mh/optimizing-moe-inference.git
cd optimizing-moe-inference
# Set up environment (auto-detects AMD vs NVIDIA)
source scripts/setup_env.sh
# Set HuggingFace token and model directory
export HF_TOKEN=your_token_here
export MODEL_DIR=/path/to/models # default: ./models# Single-GPU models (LLaDA-MoE, Qwen-MoE, OLMoE, Mixtral, LLaDA-8B)
bash scripts/download_models.sh single_gpu
# Diffusion LLM models only
bash scripts/download_models.sh diffusion_llm
# All models
bash scripts/download_models.sh all# Single-GPU baseline
bash scripts/run_benchmark.sh --model mixtral_8x7b --experiment single_gpu
# Multi-GPU with expert parallelism
bash scripts/run_benchmark.sh --model mixtral_8x7b --experiment multi_gpu --strategy ep_only
# With torch profiling
bash scripts/run_benchmark.sh --model mixtral_8x7b --experiment single_gpu --profile
# Dry run (shows commands without executing)
bash scripts/run_benchmark.sh --model mixtral_8x7b --experiment single_gpu --dry-run# Run load balance simulation and analysis
python -m src.placement.load_balancing
# Generate load balance visualizations
python analysis/plot_load_balance.py# Generate the full factorial design matrix
python -m src.benchmark.factorial_study# AMD MI300X
docker build -f docker/Dockerfile.rocm -t moe-opt:rocm .
docker run --device /dev/kfd --device /dev/dri --group-add video \
-e HF_TOKEN=$HF_TOKEN -v $MODEL_DIR:/models:ro \
-it moe-opt:rocm
# NVIDIA
docker build -f docker/Dockerfile.cuda -t moe-opt:cuda .
docker run --gpus all -e HF_TOKEN=$HF_TOKEN -v $MODEL_DIR:/models:ro \
-it moe-opt:cudapython analysis/plot_results.py # Throughput/latency plots
python analysis/plot_load_balance.py # EP load balance heatmaps
python analysis/generate_tables.py # LaTeX tables for paperThe experiments are organized into 7 phases, each building on the previous. See experiment_starter.md for the complete end-to-end guide with exact commands, data capture tables, and a checklist.
| Phase | Goal | Est. Time |
|---|---|---|
| 1. Single-GPU Baselines | Per-model throughput, profiling traces | 2-3h |
| 2. Expert Routing Analysis | Activation patterns, Gini coefficients, co-activation | 1-2h |
| 3. Multi-GPU Placement | TP vs EP vs hybrid across 1-8 GPUs (factorial study) | 8-12h |
| 4. Expert-Aware Batching | Queue depth sweep [4-256] per model | 3-4h |
| 5. CPU Predictor Training | Train RandomForest on 50+ data points from Phases 1-4 | 1-2h |
| 6. Multi-Node Scaling | EP across 2-4 nodes with CX-7 RDMA (DeepSeek-V3, DBRX) | 4-8h |
| 7. Paper Analysis | Figures, ANOVA, LaTeX tables for SIEDS 2026 | 2-3h |
# Download models and run single-GPU baselines
bash scripts/download_models.sh single_gpu
bash scripts/run_benchmark.sh --model mixtral_8x7b --experiment single_gpu
bash scripts/run_llada_benchmarks.shExperiment 1 — Single-GPU Baseline: Sweep workloads (short/medium/long) x concurrency (1-64) for OLMoE, Qwen-MoE, Mixtral, LLaDA-MoE, LLaDA-8B. Metrics: throughput, TTFT, ITL, GPU memory, CU utilization.
Experiment 2 — Expert Placement (Factorial Study): Full factorial design crossing Model x GPUs x Strategy (TP/EP/hybrid) x Queue depth x Workload. ANOVA determines which factors most impact throughput. Run python -m src.benchmark.factorial_study to generate the design matrix.
Experiment 3 — Expert-Aware Batching: Queue depth sweep [4, 8, 16, 32, 64, 128, 256] per model to find the throughput-memory sweet spot.
Experiment 4 — Multi-GPU Scaling: Scaling efficiency from 1 to 8 GPUs with AITer fused MoE kernels and hipBLASLt GEMM. Compare allgather_reducescatter vs pplx backends.
Experiment 5 — Multi-Node Scaling: Cross-node EP with RDMA (2-4 nodes, CX-7 NICs) for DBRX (132B) and DeepSeek-V3 (671B).
See Report1.md for results already collected.
The load balancing module (src/placement/load_balancing.py) provides:
- Imbalance metrics: Load imbalance ratio, coefficient of variation, Gini coefficient
- Hot/cold expert detection: Identifies over- and under-utilized experts
- Routing simulation: Uniform, Zipfian, and skewed distributions
- Rebalancing recommendations: Expert replication, migration, greedy re-mapping
- Visualization: GPU load bar charts, expert activation heatmaps
from placement.load_balancing import run_load_balance_study, print_load_balance_summary
from utils.config import get_model_config
model = get_model_config("mixtral_8x7b")
reports = run_load_balance_study(model, num_gpus=8)
print_load_balance_summary(reports["zipf"])from placement.estimator import recommend_placement
from utils.config import get_model_config
model = get_model_config("mixtral_8x7b")
rec = recommend_placement(model, num_gpus=8, gpu_memory_gb=192.0)
print(f"Strategy: {rec.strategy_name}")
print(f"Memory/GPU: {rec.memory_per_gpu_gb:.1f} GB")
print(f"Queue depth: {rec.estimated_queue_depth}")| Variable | Purpose |
|---|---|
VLLM_ROCM_USE_AITER=1 |
Enable AITer high-performance kernels |
VLLM_ROCM_USE_AITER_MOE=1 |
AITer fused MoE kernels (top-k routing, sorting) |
HIP_FORCE_DEV_KERNARG=1 |
Faster HIP kernel argument passing |
TORCH_BLAS_PREFER_HIPBLASLT=1 |
Prefer hipBLASLt for GEMM operations |
NCCL_MIN_NCHANNELS=112 |
Optimize multi-GPU NCCL collectives |
VLLM_USE_TRITON_FLASH_ATTN=0 |
Use CK-based FlashAttention (faster on MI300X) |
| Tool | Platform | Command |
|---|---|---|
| Torch Profiler | Both | bash scripts/run_profiling.sh --model MODEL --torch |
| rocprofv3 | AMD | bash scripts/run_profiling.sh --model MODEL --rocprof |
| Nsight Systems | NVIDIA | bash scripts/run_profiling.sh --model MODEL --nsight |
View torch traces at Perfetto UI or chrome://tracing.
All configs in YAML under configs/. Environment variables (set at runtime, never hardcoded):
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
(none) | HuggingFace authentication token |
MODEL_DIR |
./models |
Path to model weights |
RESULTS_DIR |
./results |
Path for benchmark output |
LLaDA diffusion models are not supported by vLLM due to their non-autoregressive masked diffusion architecture. We built a custom inference engine:
| Module | Purpose |
|---|---|
src/inference/llada_engine.py |
Single-GPU LLaDA inference with block-based denoising |
src/inference/llada_distributed.py |
Multi-GPU distributed inference with RCCL |
src/inference/expert_parallel.py |
Expert placement strategies + all-to-all dispatch |
src/inference/profiler.py |
Profiling for torch.profiler and rocprofv3 |
docker/Dockerfile.llada |
Docker image for LLaDA on ROCm |
scripts/run_llada_benchmarks.sh |
Automated sweep runner |
See Report1.md for full benchmark results.
This codebase implements the full methodology described in the SIEDS 2026 abstract:
| Abstract Claim | Implementation |
|---|---|
| Expert-aware batching with tunable queue depth | src/placement/estimator.py - estimate_queue_depth() |
| Static vs distributed expert placement | src/placement/strategies.py - 5 strategies |
| Placement estimation framework | src/placement/estimator.py - recommend_placement() |
| Lightweight CPU-based ML predictor | src/placement/predictor.py - RandomForest on CPU |
| Controlled factorial study | src/benchmark/factorial_study.py |
| rocprof + Nsight profiling | src/profiling/rocm_profiler.py, nvidia_profiler.py |
| Memory bandwidth + CU occupancy analysis | src/benchmark/metrics.py, gpu_info.py |
| Multi-GPU and multi-node scaling (up to 4 nodes) | configs/experiments/multi_node.yaml |
| EP load balance analysis | src/placement/load_balancing.py |
| Diffusion MoE inference (LLaDA) | src/inference/llada_engine.py, llada_distributed.py |
| Expert Parallelism with RCCL | src/inference/expert_parallel.py |
MIT License. See LICENSE.