This repo supports both vLLM and SGLang, but on a single GPU host (for example AWS g5.2xlarge with one A10G) the recommended operating mode is:
- start one engine only,
- wait for health,
- run one or more scenarios,
- stop that engine,
- cool down if switching engines,
- start the other engine.
Running both engines at the same time on a single A10G can cause:
- VRAM contention,
- misleading latency / throughput results,
- unstable startup behavior,
- cache allocation failures,
- and unfair comparisons.
docker compose up -d dashboard vllm
curl http://localhost:8000/healthdocker compose up -d dashboard sglang
curl http://localhost:8001/healthdocker compose stop vllm sglang
sleep 300 # optional cooldown between engines for long benchmark sessionsThe dashboard is intentionally decoupled from engine startup so it can run while:
- only vLLM is running,
- only SGLang is running,
- or both are absent.
That makes it safer for sequential benchmark workflows.