Skip to content

varad-more/sagemaker-llm-inference-optimizer

SageMaker LLM Inference Optimizer

CI Python 3.11+ License: MIT

A production-grade benchmarking framework that deploys and evaluates LLM inference backends on Amazon SageMaker to find the optimal serving configuration by latency, throughput, and cost. It automates the full lifecycle — deploy, load test, collect metrics, visualize, and recommend — so you can make data-driven decisions about how to serve your models in production.


Results

Browse the full interactive report: results/audited-2026-04-07-current/benchmark_report.html

Raw metrics & charts: results/audited-2026-04-07-current/

Detailed write-ups: reports/latest-benchmark-report.md | reports/public-vllm-validated-2026-04-06.md

14 configurations benchmarked across 3 backends (vLLM, TensorRT-LLM, llama.cpp), 5 model variants, GPU and CPU instance types. 54 metric rows collected, 50 with finite cost.

Best Observed Points

Metric Best observed point Notes
Lowest cost/token overall vllm-llama32-1b-fp16-g5 @ c=25$0.37/M tokens 1B GPU rerun; not directly comparable to 7B-class workloads
Highest throughput overall vllm-llama32-1b-fp16-g5 @ c=251057.9 tok/s Fastest audited point in the current bundle
Lowest overall P95 E2E latency vllm-tinyllama-1.1b-fp16-g5-test @ c=1579.7 ms Smoke config only; not apples-to-apples with the full sweeps
Lowest cost/token among 7B-class configs trtllm-mistral7b-awq-int4-g5 @ c=10$0.86/M tokens Best 7B-class cost point in a full 4-level sweep
Best comparable FP16 7B point vllm-mistral7b-fp16-g5 @ c=10192.3 tok/s, 9077.7 ms P95, $2.04/M Best completed FP16 7B point on the lower-cost GPU instance
Best CPU low-error point llamacpp-llama32-1b-q4km-m54 @ c=172.5 tok/s, 5844.5 ms P95, $3.53/M c=1,5,10 were clean; c=25 degraded badly
Best Mistral 7B CPU point llamacpp-mistral-q4km-m54 @ c=111.7 tok/s, 31767.9 ms P95, $21.79/M ml.m5.xlarge failed warmup entirely; m5.2xlarge / m5.4xlarge collapse above c=1

Best Successful / Most Useful Point Per Audited Config

Config Best conc. Throughput (tok/s) E2E P95 (ms) $/M tokens Notes
vllm-llama32-1b-fp16-g5 25 1057.86 2024.30 0.37 Best overall throughput + cost in the current audited bundle
trtllm-mistral7b-awq-int4-g5 10 493.13 3403.34 0.86 Best overall 7B-class cost + throughput
vllm-mistral7b-gptq-int4-g5 25 436.67 3481.04 0.90 GPTQ-INT4 on DJL-LMI 0.36.0
vllm-llama31-8b-fp16-g5 25 225.05 9642.77 1.74 Gated model (requires HF_TOKEN)
trtllm-mistral7b-fp16-g5 10 191.99 8628.98 2.20 Precompiled engine on ml.g5.2xlarge
vllm-mistral7b-fp16-g5 10 192.27 9077.69 2.04 Best completed FP16 result on ml.g5.xlarge
trtllm-mistral7b-fp16-g5-4xlarge 10 190.55 8748.68 2.96 Similar throughput to vLLM FP16, but on a pricier instance
vllm-tinyllama-1.1b-fp16-g5-test 2 173.19 1705.46 2.26 Smoke config only: c=[1,2], 10 measured requests, max_output_tokens=64
vllm-qwen25-7b-awq-int4-g5 10 114.10 19012.41 3.43 Stable but still weak value among the 7B GPU sweeps
llamacpp-llama32-1b-q4km-m54 1 72.47 5844.47 3.53 CPU run was clean through c=10, then degraded hard at c=25
llamacpp-qwen15b-q4km-m54 5 60.29 33386.76 4.24 c=1,5,10 were clean; c=25 had 82% errors
llamacpp-tinyllama-q4km-m54 1 52.66 6138.54 4.85 c=1,5,10 were clean; c=25 had 15% errors
llamacpp-mistral-q4km-m54 1 11.73 31767.88 21.79 Only c=1 was healthy; c=5 had 98% errors, c=10,25 fully collapsed
llamacpp-mistral-q4km-m52 1 6.32 57593.96 20.22 Only c=1 was usable; c=5 had 99% errors, c=10,25 fully collapsed

Comparison Caveats

  • The smoke config is not a fair apples-to-apples comparison. vllm-tinyllama-1.1b-fp16-g5-test intentionally uses c=[1,2], only 10 measured requests, and max_output_tokens=64. Treat it as a fast health-check datapoint.
  • Most completed full sweeps use sharegpt_subset, c=[1,5,10,25], and 100 measured requests per level.
  • Cross-backend and cross-instance comparisons are directional, not perfectly controlled. ml.g5.xlarge, ml.g5.2xlarge, ml.g5.4xlarge, and ml.m5.* are solving different cost/latency trade-offs.
  • llama.cpp TTFT is currently instrumentation-limited. CPU rows currently record ttft_p95=0.0, so TTFT-specific and radar-style cross-backend comparisons are not trustworthy yet.
  • Several CPU runs only stay healthy at low or medium concurrency. Do not treat their c=25 rows as production-ready just because a metric file exists.
  • If you want the cleanest public-model apples-to-apples slice, use reports/public-vllm-validated-2026-04-06.md.

How It Works

This tool answers a simple question: "What's the best way to serve my LLM in production?"

It works in four stages:

  1. Deploy — For each configuration (backend + quantization + instance type), the framework provisions a SageMaker real-time endpoint using either a managed DJL container (vLLM / TensorRT-LLM) or a custom BYOC container (llama.cpp). Endpoints are created inside a Python context manager that guarantees teardown even on crashes or interrupts.

  2. Benchmark — Once the endpoint is healthy, the load generator fires async HTTP requests at increasing concurrency levels (e.g., 1, 5, 10, 25 concurrent users). Each request captures timestamps for time-to-first-token (TTFT), inter-token latency (ITL), and end-to-end latency. A brief warmup phase stabilizes JIT compilation and KV caches before measured requests begin.

  3. Analyze — Raw per-request traces are aggregated into percentile metrics (P50/P95/P99), throughput (tokens/sec), cost per million tokens, and Model Bandwidth Utilization (MBU). A Pareto frontier algorithm identifies the configurations that offer the best trade-offs between cost and latency — no other config is both cheaper AND faster.

  4. Visualize — Eight interactive Plotly charts are generated (Pareto scatter, latency distributions, throughput scaling, cost bars, MBU bars, radar comparison, throughput heatmap, cost-performance bubble). A Streamlit dashboard lets you explore results interactively, and an optional MLflow integration logs all metrics for experiment tracking.

The entire pipeline is config-driven: each YAML file in configs/ defines a complete experiment (model, backend, instance, benchmark parameters). Add a new YAML, run make benchmark-single, and the framework handles the rest.


Why This Project

Choosing the right LLM serving stack involves hard trade-offs:

  • vLLM gives great throughput but requires GPU instances
  • TensorRT-LLM can be faster after compilation but has cold-start overhead
  • INT4 quantization cuts cost but may affect output quality
  • llama.cpp on CPU is cheap but slower at high concurrency

This tool runs controlled experiments across all of these and gives you the numbers to decide. No guessing.


Backends & Configurations

All results live in results/audited-2026-04-07-current/. Historical reruns are archived under results/archive-2026-04-07/.

GPU Backends

Config / run Backend Model Quantization Instance GPU $/hr Final status
vllm_llama32_1b_fp16 vLLM via DJL-LMI Llama-3.2-1B-Instruct FP16 ml.g5.xlarge A10G 24 GB $1.41 Complete — best c=25, 1057.86 tok/s, $0.37/M
vllm_llama31_8b_fp16 vLLM via DJL-LMI Llama-3.1-8B-Instruct FP16 ml.g5.xlarge A10G 24 GB $1.41 Complete — best c=25, 225.05 tok/s, $1.74/M
vllm_fp16 vLLM via DJL-LMI Mistral-7B-Instruct-v0.3 FP16 ml.g5.xlarge A10G 24 GB $1.41 Complete — best c=10, 192.27 tok/s, $2.04/M
vllm_awq_int4 vLLM via DJL-LMI Qwen2.5-7B-Instruct-AWQ AWQ-INT4 ml.g5.xlarge A10G 24 GB $1.41 Complete — best c=10, 114.10 tok/s, $3.43/M
test_single vLLM via DJL-LMI TinyLlama-1.1B-Chat-v1.0 FP16 ml.g5.xlarge A10G 24 GB $1.41 Smoke configc=[1,2] only, best c=2, 173.19 tok/s, $2.26/M
trtllm_fp16 TensorRT-LLM via DJL Mistral-7B-Instruct-v0.3 FP16 ml.g5.2xlarge A10G 24 GB $1.52 Complete (precompiled engine) — best c=10, 191.99 tok/s, $2.20/M
trtllm_fp16_g54xlarge TensorRT-LLM via DJL Mistral-7B-Instruct-v0.3 FP16 ml.g5.4xlarge A10G 24 GB $2.03 Complete — best c=10, 190.55 tok/s, $2.96/M
trtllm_awq_int4 TensorRT-LLM via DJL Mistral-7B-Instruct-v0.3 AWQ-INT4 ml.g5.2xlarge A10G 24 GB $1.52 Complete — best c=10, 493.13 tok/s, $0.86/M
vllm_gptq_int4 vLLM via DJL-LMI Mistral-7B-GPTQ GPTQ-INT4 ml.g5.xlarge A10G 24 GB $1.41 Complete — best c=25, 436.67 tok/s, $0.90/M

CPU Backends

Config / run Backend Model Quantization Instance Compute $/hr Final status
llamacpp-mistral-q4km-m5 llama.cpp (BYOC) Mistral-7B GGUF Q4_K_M GGUF Q4_K_M ml.m5.xlarge CPU only $0.23 Failed — endpoint launched, but warmup failed 10/10 requests; excluded from the canonical metric bundle
llamacpp-mistral-q4km-m52 llama.cpp (BYOC) Mistral-7B GGUF Q4_K_M GGUF Q4_K_M ml.m5.2xlarge CPU only $0.46 Executed, limited viabilityc=1 usable (6.32 tok/s, 57593.96 ms P95, $20.22/M); c=5 had 99% errors; c=10,25 had no successful requests
llamacpp-mistral-q4km-m54 llama.cpp (BYOC) Mistral-7B GGUF Q4_K_M GGUF Q4_K_M ml.m5.4xlarge CPU only $0.92 Executed, limited viabilityc=1 usable (11.73 tok/s, 31767.88 ms P95, $21.79/M); c=5 had 98% errors; c=10,25 had no successful requests
llamacpp_tinyllama_11b_q4km_m54 llama.cpp (BYOC) TinyLlama-1.1B-Chat-v1.0 GGUF GGUF Q4_K_M ml.m5.4xlarge CPU only $0.92 Executed, low/mid concurrency cleanc=1,5,10 were error-free; best point c=1, 52.66 tok/s, $4.85/M; c=25 had 15% errors
llamacpp_llama32_1b_q4km_m54 llama.cpp (BYOC) Llama-3.2-1B-Instruct GGUF GGUF Q4_K_M ml.m5.4xlarge CPU only $0.92 Executed, low/mid concurrency cleanc=1,5,10 were error-free; best point c=1, 72.47 tok/s, $3.53/M; c=25 had 76% errors
llamacpp_qwen25_15b_q4km_m54 llama.cpp (BYOC) Qwen2.5-1.5B-Instruct GGUF GGUF Q4_K_M ml.m5.4xlarge CPU only $0.92 Executed, low/mid concurrency cleanc=1,5,10 were error-free; best point c=5, 60.29 tok/s, $4.24/M; c=25 had 82% errors

Comparison notes:

  • CPU and GPU results are useful for different operating points and are not directly comparable.
  • Most full configs use sharegpt_subset, c=[1,5,10,25], and 100 measured requests.
  • The smoke config does not share that sweep. test_single uses c=[1,2], 10 measured requests, and max_output_tokens=64, so treat it as a low-cost health check rather than a strict peer in the comparison set.
  • llama.cpp TTFT is currently missing from the metric stream. CPU rows currently show ttft_p95=0.0, so TTFT-specific charts and radar comparisons should be read with caution.

Metrics Collected

Each benchmark run captures per-request traces and aggregates them into:

Metric What it measures Why it matters
TTFT (P50/P95/P99) Time to first token User-perceived responsiveness
ITL (P50/P95/P99) Inter-token latency Smoothness of streaming output
E2E Latency (P50/P95/P99) Full request latency SLA compliance
Throughput Tokens/sec, Requests/sec Capacity planning
Cost $/million output tokens Budget optimization
MBU Model Bandwidth Utilization Hardware efficiency (are you wasting GPU memory bandwidth?)
Error Rate Failed / total requests Reliability under load

Architecture

                        ┌──────────────────────────────────────────────┐
                        │              YAML Configs                     │
                        │  (backend, quantization, instance, params)    │
                        └──────────────────┬───────────────────────────┘
                                           │
                                           ▼
┌──────────────────┐    ┌──────────────────────────────────┐    ┌──────────────────┐
│                  │    │        Endpoint Manager           │    │                  │
│   Config Loader  │───▶│  Deploy ──▶ Health Check ──▶ Yield │───▶│   SageMaker      │
│   (Pydantic +    │    │  (context manager with finally    │    │   Endpoint       │
│    YAML merge)   │    │   guarantees teardown)            │    │   (InService)    │
│                  │    └──────────────────────────────────┘    └────────┬─────────┘
└──────────────────┘                                                    │
                                                                        ▼
┌──────────────────┐    ┌──────────────────────────────────┐    ┌──────────────────┐
│                  │    │       Metrics Collector           │    │                  │
│  Pareto Analysis │◀───│  TTFT, ITL, E2E percentiles      │◀───│  Load Generator  │
│  + Plotly Charts │    │  Throughput, Cost, MBU            │    │  (async, bounded │
│  + Streamlit     │    │                                  │    │   concurrency)   │
│                  │    └──────────────────────────────────┘    └──────────────────┘
└───────┬──────────┘
        │
        ▼
┌──────────────────┐    ┌──────────────────────────────────┐
│  results/        │    │  MLflow Tracking (optional)       │
│  - JSON metrics  │    │  SageMaker Model Registry         │
│  - HTML charts   │    │  (registers Pareto-optimal        │
│                  │    │   configs for prod deployment)    │
└──────────────────┘    └──────────────────────────────────┘

Key design decisions:

  • Guaranteed cleanupEndpointManager.managed_endpoint() uses a finally block + SIGTERM handler so endpoints never get orphaned
  • Cost safety — benchmarks exceeding $10 estimated cost are blocked unless --allow-high-cost is passed
  • Structured logging — JSON-formatted logs with correlation IDs via structlog for production traceability
  • Security hardened — IAM scoped to llm-bench-* resources, KMS encryption on S3, bandit + pip-audit in CI
  • Config inheritance — each YAML overrides configs/base.yaml, keeping configs DRY
  • Deterministic prompts — seeded RNG ensures identical inputs across runs for reproducibility

Project Structure

sagemaker-llm-inference-optimizer/
│
├── src/
│   ├── config.py                  # Pydantic config system (YAML + .env)
│   ├── logging_config.py          # Structured logging setup (structlog + JSON)
│   ├── deploy/
│   │   ├── base.py               # Abstract deployer interface + shared helpers
│   │   ├── vllm_deployer.py      # vLLM on SageMaker LMI (DJL container)
│   │   ├── trtllm_deployer.py    # TensorRT-LLM (extends vLLM deployer)
│   │   ├── llamacpp_deployer.py  # Custom BYOC container on CPU
│   │   ├── endpoint_manager.py   # Lifecycle manager + cleanup guarantees
│   │   └── container_utils.py    # Image URI resolution helpers
│   ├── benchmark/
│   │   ├── runner.py             # CLI entry-point + orchestration
│   │   ├── load_generator.py     # Async concurrent request engine
│   │   ├── metrics_collector.py  # Percentile aggregation + cost calc
│   │   └── prompts.py            # Prompt datasets (ShareGPT-like, synthetic)
│   ├── analysis/
│   │   ├── pareto.py             # 2D + N-dimensional Pareto frontier
│   │   ├── cost_calculator.py    # $/token projections
│   │   ├── mbu.py                # Model bandwidth utilization estimation
│   │   └── visualizations.py     # 8 Plotly chart generators + HTML report
│   ├── tracking/
│   │   ├── mlflow_logger.py      # MLflow integration (optional)
│   │   └── model_registry.py     # SageMaker Model Registry
│   └── dashboard/
│       └── app.py                # Streamlit interactive dashboard
│
├── configs/                       # Benchmark YAML configurations
│   ├── base.yaml                 # Shared defaults (model, benchmark params)
│   ├── vllm_fp16.yaml
│   ├── vllm_awq_int4.yaml
│   ├── vllm_gptq_int4.yaml
│   ├── vllm_llama32_1b_fp16.yaml
│   ├── vllm_llama31_8b_fp16.yaml
│   ├── trtllm_fp16.yaml
│   ├── trtllm_fp16_g54xlarge.yaml
│   ├── trtllm_awq_int4.yaml
│   ├── llamacpp_gguf_q4km.yaml
│   ├── llamacpp_llama32_1b_q4km_m54.yaml
│   ├── llamacpp_qwen25_15b_q4km_m54.yaml
│   ├── llamacpp_tinyllama_11b_q4km_m54.yaml
│   ├── test_single.yaml          # Minimal smoke-test config (fast, low cost)
│   ├── environments/             # Environment-specific overrides (dev/staging/prod)
│   ├── full_gpu/                 # Preset sweep: vLLM + TRT-LLM on ml.g5
│   │   ├── base.yaml
│   │   └── *.yaml
│   └── vllm_only/                # Preset sweep: vLLM-only subset
│       ├── base.yaml
│       └── *.yaml
│
├── docs/
│   ├── images/                   # Chart PNGs from benchmark runs
│   ├── EXPLAINED.md              # Deep-dive: design decisions and trade-offs
│   ├── RUNBOOK.md                # Production operations runbook
│   └── trtllm-precompiled-engine.md
│
├── tests/                         # Pytest suite (70 tests, moto-mocked AWS)
├── infrastructure/terraform/      # IaC for S3, ECR, IAM, CloudWatch alarms
├── scripts/                       # AWS bootstrap + container build helpers
├── reports/                       # Benchmark report with full result tables
├── environment.yml                # Conda environment definition
├── Makefile                       # All commands (conda-based)
├── Dockerfile.llamacpp            # Multi-stage build for llama.cpp BYOC
└── .github/workflows/             # CI (lint + type check + test) + benchmark dispatch

Quick Start

1. Create the conda environment

make env
conda activate sagemaker-llm-optimizer

2. Run lint and tests (no AWS needed)

make lint
make test

3. Configure AWS

cp .env.example .env

Edit .env with your values:

AWS_DEFAULT_REGION=us-east-1
SAGEMAKER_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerBenchmarkRole
SAGEMAKER_S3_BUCKET=my-benchmark-bucket
ECR_REPO_URI=123456789012.dkr.ecr.us-east-1.amazonaws.com/llama-cpp-server  # only for llama.cpp
HF_TOKEN=hf_...  # only for gated models (Llama)

4. Bootstrap AWS resources

make setup-aws           # Creates S3 bucket, ECR repo, IAM role
make deploy-infra        # (Optional) Terraform for CloudWatch alarms

5. Validate configs (dry run, no AWS calls)

make dry-run

Output:

Validated 13 configs
- llamacpp-mistral7b-gguf-q4km-m5: backend=llamacpp, instance=ml.m5.xlarge, $/hr=0.23, concurrency=[1, 5, 10, 25]
- llamacpp-llama32-1b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-qwen15b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-tinyllama-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- vllm-tinyllama-1.1b-fp16-g5-test: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 2]
- trtllm-mistral7b-awq-int4-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5-4xlarge: backend=trtllm, instance=ml.g5.4xlarge, $/hr=2.03, concurrency=[1, 5, 10, 25]
- vllm-qwen25-7b-awq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-gptq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama31-8b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama32-1b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]

6. Run benchmarks

# Single config
make benchmark-single CONFIG=configs/vllm_fp16.yaml

# All configs in the top-level configs/ directory
make benchmark

# Preset sweep — vLLM only (3 configs)
python -m src.benchmark.runner --config-dir configs/vllm_only

# Preset sweep — full GPU (vLLM + TRT-LLM, 5 configs)
python -m src.benchmark.runner --config-dir configs/full_gpu

# Scan all config subdirectories in one pass
python -m src.benchmark.runner --config-dir configs/ --recursive

7. Explore results

make dashboard           # Streamlit at http://localhost:8501
make sample-report       # Generate synthetic demo data (no AWS needed)

Command Reference

Command Description
make env Create the conda environment from environment.yml
make env-update Update an existing conda environment
make lint Lint (ruff) + type check (mypy)
make test Run the full pytest suite
make dry-run Validate all configs, print cost estimates (no AWS calls)
make benchmark Run all configs in configs/ end-to-end
make benchmark-single CONFIG=<path> Run a single config
make dashboard Launch Streamlit results dashboard
make sample-report Generate synthetic demo data for the dashboard
make clean Delete ALL llm-bench-* SageMaker endpoints (safety sweep)
make build-llamacpp Build + push the llama.cpp BYOC container to ECR
make setup-aws Bootstrap S3, ECR, IAM via helper script
make deploy-infra Provision Terraform resources (monitoring, alarms)
make destroy-infra Tear down Terraform resources

CLI flags

python -m src.benchmark.runner [OPTIONS]

  --config PATH                Single config YAML to run
  --config-dir PATH            Directory of config YAMLs  [default: configs/]
  --recursive                  Also scan config subdirectories (full_gpu/, vllm_only/)
  --instance-type TEXT         Override endpoint instance type
  --instance-cost-per-hour FLOAT  Override hourly cost
  --model-s3-uri TEXT          Override MODEL_S3_URI (llama.cpp/GGUF)
  --model-data-url TEXT        Override model_data_url (TRT-LLM precompiled)
  --dry-run                    Validate configs only, no AWS calls
  --allow-high-cost            Allow benchmarks with estimated cost > $10

Pre-compiled TRT-LLM Engine

If you already have a compiled TRT-LLM engine in S3, deploy it directly via model_data_url instead of compiling from Hugging Face at endpoint startup (which takes 15–30 min on a g5).

name: "trtllm-mistral7b-fp16-precompiled-g5-4xlarge"
model:
  model_id: "mistralai/Mistral-7B-Instruct-v0.3"
  model_data_url: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/model.tar.gz"
  quantization: "fp16"
  backend: "trtllm"
endpoint:
  instance_type: "ml.g5.4xlarge"
  instance_cost_per_hour: 2.03
  container_startup_timeout: 3600

Helper scripts:

  • scripts/trtllm_precompile_train.sh — container-side compile + bundle script
  • scripts/create_trtllm_precompile_training_job.py — submit a SageMaker Training Job that outputs model.tar.gz

See docs/trtllm-precompiled-engine.md for the full workflow.


Teardown & Cleanup

Endpoints cost money while running. The runner tears down each endpoint automatically in a finally block. If a run is interrupted (Ctrl+C, SSH drop, OOM kill), endpoints may be orphaned:

# Delete ALL llm-bench-* endpoints in your account
make clean

# Verify nothing is still running
aws sagemaker list-endpoints --region us-east-1 \
  --query 'Endpoints[?starts_with(EndpointName, `llm-bench-`)].{Name:EndpointName,Status:EndpointStatus}' \
  --output table

Full resource teardown:

make clean                        # 1. endpoints
make destroy-infra                # 2. Terraform (CloudWatch alarms)
aws s3 rb s3://<your-bucket> --force      # 3. S3 buckets
aws ecr delete-repository --repository-name llm-inference-optimizer-byoc --force --region us-east-1

Known Incompatibilities

Gated Meta Llama configs (vllm_llama31_8b_fp16, vllm_llama32_1b_fp16) — These configs require HF_TOKEN to be set in your .env file. Without it, the endpoint will fail to download the gated model weights.

TRT-LLM FP16 on ml.g5.2xlarge without precompiled artifacts — The direct startup path times out before reaching InService. Use a pre-compiled engine via model_data_url instead (see Pre-compiled TRT-LLM Engine).

llama.cpp BYOC on ml.m5.xlarge for Mistral 7B — The run executed, but warmup failed 10/10 requests with container timeout / backend-not-ready errors. This is the current lower-bound failure point in the CPU sizing sweep.

llama.cpp BYOC on ml.m5.2xlarge / ml.m5.4xlarge for Mistral 7B — These sizes can produce metric rows, but are only realistically healthy at c=1. Higher concurrency levels degrade sharply (98%+ error rate by c=5, full collapse by c=10).

llama.cpp TTFT instrumentation — CPU llama.cpp rows currently record ttft_p95=0.0, which means TTFT-specific charts and radar comparisons across backends are not trustworthy yet. E2E latency, throughput, and error-rate fields are still useful.

GPTQ and DJL-LMI versionvllm_gptq_int4 requires DJL-LMI 0.36.0+ (the default). Older versions (0.32.0 / vLLM 0.7.3) hit a partial_rotary_factor failure that has since been resolved.


Raw Data

Path Description
results/audited-2026-04-07-current/all_metrics.json Merged metric bundle (all backends, all concurrency levels)
results/audited-2026-04-07-current/summary.json Bundle-level summary with chart paths
results/audited-2026-04-07-current/audit_summary.json Audit outcome, warnings, failed configs, and notes
results/audited-2026-04-07-current/canonical_selection.json Source-directory mapping for each config
results/audited-2026-04-07-current/*.html Interactive Plotly charts (Pareto, latency, throughput, cost, MBU, radar, heatmap, bubble)
results/audited-2026-04-07-current/benchmark_report.html Single merged HTML report with all charts
results/archive-2026-04-07/ Historical raw artifact sets and prior benchmark runs
results/archive-2026-04-07/sample/ Synthetic demo data for offline dashboard testing
reports/latest-benchmark-report.md Methodology and status report
reports/public-vllm-validated-2026-04-06.md Public-vLLM-only validated baseline report

About

Benchmark and optimize LLM inference backends on Amazon SageMaker for latency, throughput, and cost.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors