SageMaker LLM Inference Optimizer

A production-grade benchmarking framework that deploys and evaluates LLM inference backends on Amazon SageMaker to find the optimal serving configuration by latency, throughput, and cost. It automates the full lifecycle — deploy, load test, collect metrics, visualize, and recommend — so you can make data-driven decisions about how to serve your models in production.

Results

Browse the full interactive report: results/audited-2026-04-07-current/benchmark_report.html

Raw metrics & charts: results/audited-2026-04-07-current/

Detailed write-ups: reports/latest-benchmark-report.md | reports/public-vllm-validated-2026-04-06.md

14 configurations benchmarked across 3 backends (vLLM, TensorRT-LLM, llama.cpp), 5 model variants, GPU and CPU instance types. 54 metric rows collected, 50 with finite cost.

Best Observed Points

Metric	Best observed point	Notes
Lowest cost/token overall	`vllm-llama32-1b-fp16-g5 @ c=25` — $0.37/M tokens	1B GPU rerun; not directly comparable to 7B-class workloads
Highest throughput overall	`vllm-llama32-1b-fp16-g5 @ c=25` — 1057.9 tok/s	Fastest audited point in the current bundle
Lowest overall P95 E2E latency	`vllm-tinyllama-1.1b-fp16-g5-test @ c=1` — 579.7 ms	Smoke config only; not apples-to-apples with the full sweeps
Lowest cost/token among 7B-class configs	`trtllm-mistral7b-awq-int4-g5 @ c=10` — $0.86/M tokens	Best 7B-class cost point in a full 4-level sweep
Best comparable FP16 7B point	`vllm-mistral7b-fp16-g5 @ c=10` — 192.3 tok/s, 9077.7 ms P95, $2.04/M	Best completed FP16 7B point on the lower-cost GPU instance
Best CPU low-error point	`llamacpp-llama32-1b-q4km-m54 @ c=1` — 72.5 tok/s, 5844.5 ms P95, $3.53/M	`c=1,5,10` were clean; `c=25` degraded badly
Best Mistral 7B CPU point	`llamacpp-mistral-q4km-m54 @ c=1` — 11.7 tok/s, 31767.9 ms P95, $21.79/M	`ml.m5.xlarge` failed warmup entirely; `m5.2xlarge` / `m5.4xlarge` collapse above `c=1`

Best Successful / Most Useful Point Per Audited Config

Config	Best conc.	Throughput (tok/s)	E2E P95 (ms)	$/M tokens	Notes
`vllm-llama32-1b-fp16-g5`	25	1057.86	2024.30	0.37	Best overall throughput + cost in the current audited bundle
`trtllm-mistral7b-awq-int4-g5`	10	493.13	3403.34	0.86	Best overall 7B-class cost + throughput
`vllm-mistral7b-gptq-int4-g5`	25	436.67	3481.04	0.90	GPTQ-INT4 on DJL-LMI 0.36.0
`vllm-llama31-8b-fp16-g5`	25	225.05	9642.77	1.74	Gated model (requires `HF_TOKEN`)
`trtllm-mistral7b-fp16-g5`	10	191.99	8628.98	2.20	Precompiled engine on `ml.g5.2xlarge`
`vllm-mistral7b-fp16-g5`	10	192.27	9077.69	2.04	Best completed FP16 result on `ml.g5.xlarge`
`trtllm-mistral7b-fp16-g5-4xlarge`	10	190.55	8748.68	2.96	Similar throughput to vLLM FP16, but on a pricier instance
`vllm-tinyllama-1.1b-fp16-g5-test`	2	173.19	1705.46	2.26	Smoke config only: `c=[1,2]`, 10 measured requests, `max_output_tokens=64`
`vllm-qwen25-7b-awq-int4-g5`	10	114.10	19012.41	3.43	Stable but still weak value among the 7B GPU sweeps
`llamacpp-llama32-1b-q4km-m54`	1	72.47	5844.47	3.53	CPU run was clean through `c=10`, then degraded hard at `c=25`
`llamacpp-qwen15b-q4km-m54`	5	60.29	33386.76	4.24	`c=1,5,10` were clean; `c=25` had 82% errors
`llamacpp-tinyllama-q4km-m54`	1	52.66	6138.54	4.85	`c=1,5,10` were clean; `c=25` had 15% errors
`llamacpp-mistral-q4km-m54`	1	11.73	31767.88	21.79	Only `c=1` was healthy; `c=5` had 98% errors, `c=10,25` fully collapsed
`llamacpp-mistral-q4km-m52`	1	6.32	57593.96	20.22	Only `c=1` was usable; `c=5` had 99% errors, `c=10,25` fully collapsed

Comparison Caveats

The smoke config is not a fair apples-to-apples comparison. vllm-tinyllama-1.1b-fp16-g5-test intentionally uses c=[1,2], only 10 measured requests, and max_output_tokens=64. Treat it as a fast health-check datapoint.
Most completed full sweeps use sharegpt_subset, c=[1,5,10,25], and 100 measured requests per level.
Cross-backend and cross-instance comparisons are directional, not perfectly controlled. ml.g5.xlarge, ml.g5.2xlarge, ml.g5.4xlarge, and ml.m5.* are solving different cost/latency trade-offs.
llama.cpp TTFT is currently instrumentation-limited. CPU rows currently record ttft_p95=0.0, so TTFT-specific and radar-style cross-backend comparisons are not trustworthy yet.
Several CPU runs only stay healthy at low or medium concurrency. Do not treat their c=25 rows as production-ready just because a metric file exists.
If you want the cleanest public-model apples-to-apples slice, use reports/public-vllm-validated-2026-04-06.md.

How It Works

This tool answers a simple question: "What's the best way to serve my LLM in production?"

It works in four stages:

Deploy — For each configuration (backend + quantization + instance type), the framework provisions a SageMaker real-time endpoint using either a managed DJL container (vLLM / TensorRT-LLM) or a custom BYOC container (llama.cpp). Endpoints are created inside a Python context manager that guarantees teardown even on crashes or interrupts.
Benchmark — Once the endpoint is healthy, the load generator fires async HTTP requests at increasing concurrency levels (e.g., 1, 5, 10, 25 concurrent users). Each request captures timestamps for time-to-first-token (TTFT), inter-token latency (ITL), and end-to-end latency. A brief warmup phase stabilizes JIT compilation and KV caches before measured requests begin.
Analyze — Raw per-request traces are aggregated into percentile metrics (P50/P95/P99), throughput (tokens/sec), cost per million tokens, and Model Bandwidth Utilization (MBU). A Pareto frontier algorithm identifies the configurations that offer the best trade-offs between cost and latency — no other config is both cheaper AND faster.
Visualize — Eight interactive Plotly charts are generated (Pareto scatter, latency distributions, throughput scaling, cost bars, MBU bars, radar comparison, throughput heatmap, cost-performance bubble). A Streamlit dashboard lets you explore results interactively, and an optional MLflow integration logs all metrics for experiment tracking.

The entire pipeline is config-driven: each YAML file in configs/ defines a complete experiment (model, backend, instance, benchmark parameters). Add a new YAML, run make benchmark-single, and the framework handles the rest.

Why This Project

Choosing the right LLM serving stack involves hard trade-offs:

vLLM gives great throughput but requires GPU instances
TensorRT-LLM can be faster after compilation but has cold-start overhead
INT4 quantization cuts cost but may affect output quality
llama.cpp on CPU is cheap but slower at high concurrency

This tool runs controlled experiments across all of these and gives you the numbers to decide. No guessing.

Backends & Configurations

All results live in results/audited-2026-04-07-current/. Historical reruns are archived under results/archive-2026-04-07/.

GPU Backends

Config / run	Backend	Model	Quantization	Instance	GPU	$/hr	Final status
`vllm_llama32_1b_fp16`	vLLM via DJL-LMI	Llama-3.2-1B-Instruct	FP16	ml.g5.xlarge	A10G 24 GB	$1.41	Complete — best `c=25`, 1057.86 tok/s, $0.37/M
`vllm_llama31_8b_fp16`	vLLM via DJL-LMI	Llama-3.1-8B-Instruct	FP16	ml.g5.xlarge	A10G 24 GB	$1.41	Complete — best `c=25`, 225.05 tok/s, $1.74/M
`vllm_fp16`	vLLM via DJL-LMI	Mistral-7B-Instruct-v0.3	FP16	ml.g5.xlarge	A10G 24 GB	$1.41	Complete — best `c=10`, 192.27 tok/s, $2.04/M
`vllm_awq_int4`	vLLM via DJL-LMI	Qwen2.5-7B-Instruct-AWQ	AWQ-INT4	ml.g5.xlarge	A10G 24 GB	$1.41	Complete — best `c=10`, 114.10 tok/s, $3.43/M
`test_single`	vLLM via DJL-LMI	TinyLlama-1.1B-Chat-v1.0	FP16	ml.g5.xlarge	A10G 24 GB	$1.41	Smoke config — `c=[1,2]` only, best `c=2`, 173.19 tok/s, $2.26/M
`trtllm_fp16`	TensorRT-LLM via DJL	Mistral-7B-Instruct-v0.3	FP16	ml.g5.2xlarge	A10G 24 GB	$1.52	Complete (precompiled engine) — best `c=10`, 191.99 tok/s, $2.20/M
`trtllm_fp16_g54xlarge`	TensorRT-LLM via DJL	Mistral-7B-Instruct-v0.3	FP16	ml.g5.4xlarge	A10G 24 GB	$2.03	Complete — best `c=10`, 190.55 tok/s, $2.96/M
`trtllm_awq_int4`	TensorRT-LLM via DJL	Mistral-7B-Instruct-v0.3	AWQ-INT4	ml.g5.2xlarge	A10G 24 GB	$1.52	Complete — best `c=10`, 493.13 tok/s, $0.86/M
`vllm_gptq_int4`	vLLM via DJL-LMI	Mistral-7B-GPTQ	GPTQ-INT4	ml.g5.xlarge	A10G 24 GB	$1.41	Complete — best `c=25`, 436.67 tok/s, $0.90/M

CPU Backends

Config / run	Backend	Model	Quantization	Instance	Compute	$/hr	Final status
`llamacpp-mistral-q4km-m5`	llama.cpp (BYOC)	Mistral-7B GGUF Q4_K_M	GGUF Q4_K_M	ml.m5.xlarge	CPU only	$0.23	Failed — endpoint launched, but warmup failed 10/10 requests; excluded from the canonical metric bundle
`llamacpp-mistral-q4km-m52`	llama.cpp (BYOC)	Mistral-7B GGUF Q4_K_M	GGUF Q4_K_M	ml.m5.2xlarge	CPU only	$0.46	Executed, limited viability — `c=1` usable (6.32 tok/s, 57593.96 ms P95, $20.22/M); `c=5` had 99% errors; `c=10,25` had no successful requests
`llamacpp-mistral-q4km-m54`	llama.cpp (BYOC)	Mistral-7B GGUF Q4_K_M	GGUF Q4_K_M	ml.m5.4xlarge	CPU only	$0.92	Executed, limited viability — `c=1` usable (11.73 tok/s, 31767.88 ms P95, $21.79/M); `c=5` had 98% errors; `c=10,25` had no successful requests
`llamacpp_tinyllama_11b_q4km_m54`	llama.cpp (BYOC)	TinyLlama-1.1B-Chat-v1.0 GGUF	GGUF Q4_K_M	ml.m5.4xlarge	CPU only	$0.92	Executed, low/mid concurrency clean — `c=1,5,10` were error-free; best point `c=1`, 52.66 tok/s, $4.85/M; `c=25` had 15% errors
`llamacpp_llama32_1b_q4km_m54`	llama.cpp (BYOC)	Llama-3.2-1B-Instruct GGUF	GGUF Q4_K_M	ml.m5.4xlarge	CPU only	$0.92	Executed, low/mid concurrency clean — `c=1,5,10` were error-free; best point `c=1`, 72.47 tok/s, $3.53/M; `c=25` had 76% errors
`llamacpp_qwen25_15b_q4km_m54`	llama.cpp (BYOC)	Qwen2.5-1.5B-Instruct GGUF	GGUF Q4_K_M	ml.m5.4xlarge	CPU only	$0.92	Executed, low/mid concurrency clean — `c=1,5,10` were error-free; best point `c=5`, 60.29 tok/s, $4.24/M; `c=25` had 82% errors

Comparison notes:

CPU and GPU results are useful for different operating points and are not directly comparable.

Most full configs use sharegpt_subset, c=[1,5,10,25], and 100 measured requests.

The smoke config does not share that sweep. test_single uses c=[1,2], 10 measured requests, and max_output_tokens=64, so treat it as a low-cost health check rather than a strict peer in the comparison set.

llama.cpp TTFT is currently missing from the metric stream. CPU rows currently show ttft_p95=0.0, so TTFT-specific charts and radar comparisons should be read with caution.

Metrics Collected

Each benchmark run captures per-request traces and aggregates them into:

Metric	What it measures	Why it matters
TTFT (P50/P95/P99)	Time to first token	User-perceived responsiveness
ITL (P50/P95/P99)	Inter-token latency	Smoothness of streaming output
E2E Latency (P50/P95/P99)	Full request latency	SLA compliance
Throughput	Tokens/sec, Requests/sec	Capacity planning
Cost	$/million output tokens	Budget optimization
MBU	Model Bandwidth Utilization	Hardware efficiency (are you wasting GPU memory bandwidth?)
Error Rate	Failed / total requests	Reliability under load

Architecture

                        ┌──────────────────────────────────────────────┐
                        │              YAML Configs                     │
                        │  (backend, quantization, instance, params)    │
                        └──────────────────┬───────────────────────────┘
                                           │
                                           ▼
┌──────────────────┐    ┌──────────────────────────────────┐    ┌──────────────────┐
│                  │    │        Endpoint Manager           │    │                  │
│   Config Loader  │───▶│  Deploy ──▶ Health Check ──▶ Yield │───▶│   SageMaker      │
│   (Pydantic +    │    │  (context manager with finally    │    │   Endpoint       │
│    YAML merge)   │    │   guarantees teardown)            │    │   (InService)    │
│                  │    └──────────────────────────────────┘    └────────┬─────────┘
└──────────────────┘                                                    │
                                                                        ▼
┌──────────────────┐    ┌──────────────────────────────────┐    ┌──────────────────┐
│                  │    │       Metrics Collector           │    │                  │
│  Pareto Analysis │◀───│  TTFT, ITL, E2E percentiles      │◀───│  Load Generator  │
│  + Plotly Charts │    │  Throughput, Cost, MBU            │    │  (async, bounded │
│  + Streamlit     │    │                                  │    │   concurrency)   │
│                  │    └──────────────────────────────────┘    └──────────────────┘
└───────┬──────────┘
        │
        ▼
┌──────────────────┐    ┌──────────────────────────────────┐
│  results/        │    │  MLflow Tracking (optional)       │
│  - JSON metrics  │    │  SageMaker Model Registry         │
│  - HTML charts   │    │  (registers Pareto-optimal        │
│                  │    │   configs for prod deployment)    │
└──────────────────┘    └──────────────────────────────────┘

Key design decisions:

Guaranteed cleanup — EndpointManager.managed_endpoint() uses a finally block + SIGTERM handler so endpoints never get orphaned
Cost safety — benchmarks exceeding $10 estimated cost are blocked unless --allow-high-cost is passed
Structured logging — JSON-formatted logs with correlation IDs via structlog for production traceability
Security hardened — IAM scoped to llm-bench-* resources, KMS encryption on S3, bandit + pip-audit in CI
Config inheritance — each YAML overrides configs/base.yaml, keeping configs DRY
Deterministic prompts — seeded RNG ensures identical inputs across runs for reproducibility

Project Structure

sagemaker-llm-inference-optimizer/
│
├── src/
│   ├── config.py                  # Pydantic config system (YAML + .env)
│   ├── logging_config.py          # Structured logging setup (structlog + JSON)
│   ├── deploy/
│   │   ├── base.py               # Abstract deployer interface + shared helpers
│   │   ├── vllm_deployer.py      # vLLM on SageMaker LMI (DJL container)
│   │   ├── trtllm_deployer.py    # TensorRT-LLM (extends vLLM deployer)
│   │   ├── llamacpp_deployer.py  # Custom BYOC container on CPU
│   │   ├── endpoint_manager.py   # Lifecycle manager + cleanup guarantees
│   │   └── container_utils.py    # Image URI resolution helpers
│   ├── benchmark/
│   │   ├── runner.py             # CLI entry-point + orchestration
│   │   ├── load_generator.py     # Async concurrent request engine
│   │   ├── metrics_collector.py  # Percentile aggregation + cost calc
│   │   └── prompts.py            # Prompt datasets (ShareGPT-like, synthetic)
│   ├── analysis/
│   │   ├── pareto.py             # 2D + N-dimensional Pareto frontier
│   │   ├── cost_calculator.py    # $/token projections
│   │   ├── mbu.py                # Model bandwidth utilization estimation
│   │   └── visualizations.py     # 8 Plotly chart generators + HTML report
│   ├── tracking/
│   │   ├── mlflow_logger.py      # MLflow integration (optional)
│   │   └── model_registry.py     # SageMaker Model Registry
│   └── dashboard/
│       └── app.py                # Streamlit interactive dashboard
│
├── configs/                       # Benchmark YAML configurations
│   ├── base.yaml                 # Shared defaults (model, benchmark params)
│   ├── vllm_fp16.yaml
│   ├── vllm_awq_int4.yaml
│   ├── vllm_gptq_int4.yaml
│   ├── vllm_llama32_1b_fp16.yaml
│   ├── vllm_llama31_8b_fp16.yaml
│   ├── trtllm_fp16.yaml
│   ├── trtllm_fp16_g54xlarge.yaml
│   ├── trtllm_awq_int4.yaml
│   ├── llamacpp_gguf_q4km.yaml
│   ├── llamacpp_llama32_1b_q4km_m54.yaml
│   ├── llamacpp_qwen25_15b_q4km_m54.yaml
│   ├── llamacpp_tinyllama_11b_q4km_m54.yaml
│   ├── test_single.yaml          # Minimal smoke-test config (fast, low cost)
│   ├── environments/             # Environment-specific overrides (dev/staging/prod)
│   ├── full_gpu/                 # Preset sweep: vLLM + TRT-LLM on ml.g5
│   │   ├── base.yaml
│   │   └── *.yaml
│   └── vllm_only/                # Preset sweep: vLLM-only subset
│       ├── base.yaml
│       └── *.yaml
│
├── docs/
│   ├── images/                   # Chart PNGs from benchmark runs
│   ├── EXPLAINED.md              # Deep-dive: design decisions and trade-offs
│   ├── RUNBOOK.md                # Production operations runbook
│   └── trtllm-precompiled-engine.md
│
├── tests/                         # Pytest suite (70 tests, moto-mocked AWS)
├── infrastructure/terraform/      # IaC for S3, ECR, IAM, CloudWatch alarms
├── scripts/                       # AWS bootstrap + container build helpers
├── reports/                       # Benchmark report with full result tables
├── environment.yml                # Conda environment definition
├── Makefile                       # All commands (conda-based)
├── Dockerfile.llamacpp            # Multi-stage build for llama.cpp BYOC
└── .github/workflows/             # CI (lint + type check + test) + benchmark dispatch

Quick Start

1. Create the conda environment

make env
conda activate sagemaker-llm-optimizer

2. Run lint and tests (no AWS needed)

make lint
make test

3. Configure AWS

cp .env.example .env

Edit .env with your values:

AWS_DEFAULT_REGION=us-east-1
SAGEMAKER_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerBenchmarkRole
SAGEMAKER_S3_BUCKET=my-benchmark-bucket
ECR_REPO_URI=123456789012.dkr.ecr.us-east-1.amazonaws.com/llama-cpp-server  # only for llama.cpp
HF_TOKEN=hf_...  # only for gated models (Llama)

4. Bootstrap AWS resources

make setup-aws           # Creates S3 bucket, ECR repo, IAM role
make deploy-infra        # (Optional) Terraform for CloudWatch alarms

5. Validate configs (dry run, no AWS calls)

make dry-run

Output:

Validated 13 configs
- llamacpp-mistral7b-gguf-q4km-m5: backend=llamacpp, instance=ml.m5.xlarge, $/hr=0.23, concurrency=[1, 5, 10, 25]
- llamacpp-llama32-1b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-qwen15b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-tinyllama-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- vllm-tinyllama-1.1b-fp16-g5-test: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 2]
- trtllm-mistral7b-awq-int4-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5-4xlarge: backend=trtllm, instance=ml.g5.4xlarge, $/hr=2.03, concurrency=[1, 5, 10, 25]
- vllm-qwen25-7b-awq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-gptq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama31-8b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama32-1b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]

6. Run benchmarks

# Single config
make benchmark-single CONFIG=configs/vllm_fp16.yaml

# All configs in the top-level configs/ directory
make benchmark

# Preset sweep — vLLM only (3 configs)
python -m src.benchmark.runner --config-dir configs/vllm_only

# Preset sweep — full GPU (vLLM + TRT-LLM, 5 configs)
python -m src.benchmark.runner --config-dir configs/full_gpu

# Scan all config subdirectories in one pass
python -m src.benchmark.runner --config-dir configs/ --recursive

7. Explore results

make dashboard           # Streamlit at http://localhost:8501
make sample-report       # Generate synthetic demo data (no AWS needed)

Command Reference

Command	Description
`make env`	Create the conda environment from `environment.yml`
`make env-update`	Update an existing conda environment
`make lint`	Lint (`ruff`) + type check (`mypy`)
`make test`	Run the full pytest suite
`make dry-run`	Validate all configs, print cost estimates (no AWS calls)
`make benchmark`	Run all configs in `configs/` end-to-end
`make benchmark-single CONFIG=<path>`	Run a single config
`make dashboard`	Launch Streamlit results dashboard
`make sample-report`	Generate synthetic demo data for the dashboard
`make clean`	Delete ALL `llm-bench-*` SageMaker endpoints (safety sweep)
`make build-llamacpp`	Build + push the llama.cpp BYOC container to ECR
`make setup-aws`	Bootstrap S3, ECR, IAM via helper script
`make deploy-infra`	Provision Terraform resources (monitoring, alarms)
`make destroy-infra`	Tear down Terraform resources

CLI flags

python -m src.benchmark.runner [OPTIONS]

  --config PATH                Single config YAML to run
  --config-dir PATH            Directory of config YAMLs  [default: configs/]
  --recursive                  Also scan config subdirectories (full_gpu/, vllm_only/)
  --instance-type TEXT         Override endpoint instance type
  --instance-cost-per-hour FLOAT  Override hourly cost
  --model-s3-uri TEXT          Override MODEL_S3_URI (llama.cpp/GGUF)
  --model-data-url TEXT        Override model_data_url (TRT-LLM precompiled)
  --dry-run                    Validate configs only, no AWS calls
  --allow-high-cost            Allow benchmarks with estimated cost > $10

Pre-compiled TRT-LLM Engine

If you already have a compiled TRT-LLM engine in S3, deploy it directly via model_data_url instead of compiling from Hugging Face at endpoint startup (which takes 15–30 min on a g5).

name: "trtllm-mistral7b-fp16-precompiled-g5-4xlarge"
model:
  model_id: "mistralai/Mistral-7B-Instruct-v0.3"
  model_data_url: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/model.tar.gz"
  quantization: "fp16"
  backend: "trtllm"
endpoint:
  instance_type: "ml.g5.4xlarge"
  instance_cost_per_hour: 2.03
  container_startup_timeout: 3600

Helper scripts:

scripts/trtllm_precompile_train.sh — container-side compile + bundle script
scripts/create_trtllm_precompile_training_job.py — submit a SageMaker Training Job that outputs model.tar.gz

See docs/trtllm-precompiled-engine.md for the full workflow.

Teardown & Cleanup

Endpoints cost money while running. The runner tears down each endpoint automatically in a finally block. If a run is interrupted (Ctrl+C, SSH drop, OOM kill), endpoints may be orphaned:

# Delete ALL llm-bench-* endpoints in your account
make clean

# Verify nothing is still running
aws sagemaker list-endpoints --region us-east-1 \
  --query 'Endpoints[?starts_with(EndpointName, `llm-bench-`)].{Name:EndpointName,Status:EndpointStatus}' \
  --output table

Full resource teardown:

make clean                        # 1. endpoints
make destroy-infra                # 2. Terraform (CloudWatch alarms)
aws s3 rb s3://<your-bucket> --force      # 3. S3 buckets
aws ecr delete-repository --repository-name llm-inference-optimizer-byoc --force --region us-east-1

Known Incompatibilities

Gated Meta Llama configs (vllm_llama31_8b_fp16, vllm_llama32_1b_fp16) — These configs require HF_TOKEN to be set in your .env file. Without it, the endpoint will fail to download the gated model weights.

TRT-LLM FP16 on ml.g5.2xlarge without precompiled artifacts — The direct startup path times out before reaching InService. Use a pre-compiled engine via model_data_url instead (see Pre-compiled TRT-LLM Engine).

llama.cpp BYOC on ml.m5.xlarge for Mistral 7B — The run executed, but warmup failed 10/10 requests with container timeout / backend-not-ready errors. This is the current lower-bound failure point in the CPU sizing sweep.

llama.cpp BYOC on ml.m5.2xlarge / ml.m5.4xlarge for Mistral 7B — These sizes can produce metric rows, but are only realistically healthy at c=1. Higher concurrency levels degrade sharply (98%+ error rate by c=5, full collapse by c=10).

llama.cpp TTFT instrumentation — CPU llama.cpp rows currently record ttft_p95=0.0, which means TTFT-specific charts and radar comparisons across backends are not trustworthy yet. E2E latency, throughput, and error-rate fields are still useful.

GPTQ and DJL-LMI version — vllm_gptq_int4 requires DJL-LMI 0.36.0+ (the default). Older versions (0.32.0 / vLLM 0.7.3) hit a partial_rotary_factor failure that has since been resolved.

Raw Data

Path	Description
`results/audited-2026-04-07-current/all_metrics.json`	Merged metric bundle (all backends, all concurrency levels)
`results/audited-2026-04-07-current/summary.json`	Bundle-level summary with chart paths
`results/audited-2026-04-07-current/audit_summary.json`	Audit outcome, warnings, failed configs, and notes
`results/audited-2026-04-07-current/canonical_selection.json`	Source-directory mapping for each config
`results/audited-2026-04-07-current/*.html`	Interactive Plotly charts (Pareto, latency, throughput, cost, MBU, radar, heatmap, bubble)
`results/audited-2026-04-07-current/benchmark_report.html`	Single merged HTML report with all charts
`results/archive-2026-04-07/`	Historical raw artifact sets and prior benchmark runs
`results/archive-2026-04-07/sample/`	Synthetic demo data for offline dashboard testing
`reports/latest-benchmark-report.md`	Methodology and status report
`reports/public-vllm-validated-2026-04-06.md`	Public-vLLM-only validated baseline report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker LLM Inference Optimizer

Results

Best Observed Points

Best Successful / Most Useful Point Per Audited Config

Comparison Caveats

How It Works

Why This Project

Backends & Configurations

GPU Backends

CPU Backends

Metrics Collected

Architecture

Project Structure

Quick Start

1. Create the conda environment

2. Run lint and tests (no AWS needed)

3. Configure AWS

4. Bootstrap AWS resources

5. Validate configs (dry run, no AWS calls)

6. Run benchmarks

7. Explore results

Command Reference

CLI flags

Pre-compiled TRT-LLM Engine

Teardown & Cleanup

Known Incompatibilities

Raw Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
infrastructure/terraform		infrastructure/terraform
reports		reports
results		results
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.llamacpp		Dockerfile.llamacpp
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SageMaker LLM Inference Optimizer

Results

Best Observed Points

Best Successful / Most Useful Point Per Audited Config

Comparison Caveats

How It Works

Why This Project

Backends & Configurations

GPU Backends

CPU Backends

Metrics Collected

Architecture

Project Structure

Quick Start

1. Create the conda environment

2. Run lint and tests (no AWS needed)

3. Configure AWS

4. Bootstrap AWS resources

5. Validate configs (dry run, no AWS calls)

6. Run benchmarks

7. Explore results

Command Reference

CLI flags

Pre-compiled TRT-LLM Engine

Teardown & Cleanup

Known Incompatibilities

Raw Data

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages