A production-grade benchmarking framework that deploys and evaluates LLM inference backends on Amazon SageMaker to find the optimal serving configuration by latency, throughput, and cost. It automates the full lifecycle — deploy, load test, collect metrics, visualize, and recommend — so you can make data-driven decisions about how to serve your models in production.
Browse the full interactive report:
results/audited-2026-04-07-current/benchmark_report.htmlRaw metrics & charts:
results/audited-2026-04-07-current/Detailed write-ups:
reports/latest-benchmark-report.md|reports/public-vllm-validated-2026-04-06.md
14 configurations benchmarked across 3 backends (vLLM, TensorRT-LLM, llama.cpp), 5 model variants, GPU and CPU instance types. 54 metric rows collected, 50 with finite cost.
| Metric | Best observed point | Notes |
|---|---|---|
| Lowest cost/token overall | vllm-llama32-1b-fp16-g5 @ c=25 — $0.37/M tokens |
1B GPU rerun; not directly comparable to 7B-class workloads |
| Highest throughput overall | vllm-llama32-1b-fp16-g5 @ c=25 — 1057.9 tok/s |
Fastest audited point in the current bundle |
| Lowest overall P95 E2E latency | vllm-tinyllama-1.1b-fp16-g5-test @ c=1 — 579.7 ms |
Smoke config only; not apples-to-apples with the full sweeps |
| Lowest cost/token among 7B-class configs | trtllm-mistral7b-awq-int4-g5 @ c=10 — $0.86/M tokens |
Best 7B-class cost point in a full 4-level sweep |
| Best comparable FP16 7B point | vllm-mistral7b-fp16-g5 @ c=10 — 192.3 tok/s, 9077.7 ms P95, $2.04/M |
Best completed FP16 7B point on the lower-cost GPU instance |
| Best CPU low-error point | llamacpp-llama32-1b-q4km-m54 @ c=1 — 72.5 tok/s, 5844.5 ms P95, $3.53/M |
c=1,5,10 were clean; c=25 degraded badly |
| Best Mistral 7B CPU point | llamacpp-mistral-q4km-m54 @ c=1 — 11.7 tok/s, 31767.9 ms P95, $21.79/M |
ml.m5.xlarge failed warmup entirely; m5.2xlarge / m5.4xlarge collapse above c=1 |
| Config | Best conc. | Throughput (tok/s) | E2E P95 (ms) | $/M tokens | Notes |
|---|---|---|---|---|---|
vllm-llama32-1b-fp16-g5 |
25 | 1057.86 | 2024.30 | 0.37 | Best overall throughput + cost in the current audited bundle |
trtllm-mistral7b-awq-int4-g5 |
10 | 493.13 | 3403.34 | 0.86 | Best overall 7B-class cost + throughput |
vllm-mistral7b-gptq-int4-g5 |
25 | 436.67 | 3481.04 | 0.90 | GPTQ-INT4 on DJL-LMI 0.36.0 |
vllm-llama31-8b-fp16-g5 |
25 | 225.05 | 9642.77 | 1.74 | Gated model (requires HF_TOKEN) |
trtllm-mistral7b-fp16-g5 |
10 | 191.99 | 8628.98 | 2.20 | Precompiled engine on ml.g5.2xlarge |
vllm-mistral7b-fp16-g5 |
10 | 192.27 | 9077.69 | 2.04 | Best completed FP16 result on ml.g5.xlarge |
trtllm-mistral7b-fp16-g5-4xlarge |
10 | 190.55 | 8748.68 | 2.96 | Similar throughput to vLLM FP16, but on a pricier instance |
vllm-tinyllama-1.1b-fp16-g5-test |
2 | 173.19 | 1705.46 | 2.26 | Smoke config only: c=[1,2], 10 measured requests, max_output_tokens=64 |
vllm-qwen25-7b-awq-int4-g5 |
10 | 114.10 | 19012.41 | 3.43 | Stable but still weak value among the 7B GPU sweeps |
llamacpp-llama32-1b-q4km-m54 |
1 | 72.47 | 5844.47 | 3.53 | CPU run was clean through c=10, then degraded hard at c=25 |
llamacpp-qwen15b-q4km-m54 |
5 | 60.29 | 33386.76 | 4.24 | c=1,5,10 were clean; c=25 had 82% errors |
llamacpp-tinyllama-q4km-m54 |
1 | 52.66 | 6138.54 | 4.85 | c=1,5,10 were clean; c=25 had 15% errors |
llamacpp-mistral-q4km-m54 |
1 | 11.73 | 31767.88 | 21.79 | Only c=1 was healthy; c=5 had 98% errors, c=10,25 fully collapsed |
llamacpp-mistral-q4km-m52 |
1 | 6.32 | 57593.96 | 20.22 | Only c=1 was usable; c=5 had 99% errors, c=10,25 fully collapsed |
- The smoke config is not a fair apples-to-apples comparison.
vllm-tinyllama-1.1b-fp16-g5-testintentionally usesc=[1,2], only 10 measured requests, andmax_output_tokens=64. Treat it as a fast health-check datapoint. - Most completed full sweeps use
sharegpt_subset,c=[1,5,10,25], and 100 measured requests per level. - Cross-backend and cross-instance comparisons are directional, not perfectly controlled.
ml.g5.xlarge,ml.g5.2xlarge,ml.g5.4xlarge, andml.m5.*are solving different cost/latency trade-offs. - llama.cpp TTFT is currently instrumentation-limited. CPU rows currently record
ttft_p95=0.0, so TTFT-specific and radar-style cross-backend comparisons are not trustworthy yet. - Several CPU runs only stay healthy at low or medium concurrency. Do not treat their
c=25rows as production-ready just because a metric file exists. - If you want the cleanest public-model apples-to-apples slice, use
reports/public-vllm-validated-2026-04-06.md.
This tool answers a simple question: "What's the best way to serve my LLM in production?"
It works in four stages:
-
Deploy — For each configuration (backend + quantization + instance type), the framework provisions a SageMaker real-time endpoint using either a managed DJL container (vLLM / TensorRT-LLM) or a custom BYOC container (llama.cpp). Endpoints are created inside a Python context manager that guarantees teardown even on crashes or interrupts.
-
Benchmark — Once the endpoint is healthy, the load generator fires async HTTP requests at increasing concurrency levels (e.g., 1, 5, 10, 25 concurrent users). Each request captures timestamps for time-to-first-token (TTFT), inter-token latency (ITL), and end-to-end latency. A brief warmup phase stabilizes JIT compilation and KV caches before measured requests begin.
-
Analyze — Raw per-request traces are aggregated into percentile metrics (P50/P95/P99), throughput (tokens/sec), cost per million tokens, and Model Bandwidth Utilization (MBU). A Pareto frontier algorithm identifies the configurations that offer the best trade-offs between cost and latency — no other config is both cheaper AND faster.
-
Visualize — Eight interactive Plotly charts are generated (Pareto scatter, latency distributions, throughput scaling, cost bars, MBU bars, radar comparison, throughput heatmap, cost-performance bubble). A Streamlit dashboard lets you explore results interactively, and an optional MLflow integration logs all metrics for experiment tracking.
The entire pipeline is config-driven: each YAML file in configs/ defines a complete experiment (model, backend, instance, benchmark parameters). Add a new YAML, run make benchmark-single, and the framework handles the rest.
Choosing the right LLM serving stack involves hard trade-offs:
- vLLM gives great throughput but requires GPU instances
- TensorRT-LLM can be faster after compilation but has cold-start overhead
- INT4 quantization cuts cost but may affect output quality
- llama.cpp on CPU is cheap but slower at high concurrency
This tool runs controlled experiments across all of these and gives you the numbers to decide. No guessing.
All results live in results/audited-2026-04-07-current/. Historical reruns are archived under results/archive-2026-04-07/.
| Config / run | Backend | Model | Quantization | Instance | GPU | $/hr | Final status |
|---|---|---|---|---|---|---|---|
vllm_llama32_1b_fp16 |
vLLM via DJL-LMI | Llama-3.2-1B-Instruct | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Complete — best c=25, 1057.86 tok/s, $0.37/M |
vllm_llama31_8b_fp16 |
vLLM via DJL-LMI | Llama-3.1-8B-Instruct | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Complete — best c=25, 225.05 tok/s, $1.74/M |
vllm_fp16 |
vLLM via DJL-LMI | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Complete — best c=10, 192.27 tok/s, $2.04/M |
vllm_awq_int4 |
vLLM via DJL-LMI | Qwen2.5-7B-Instruct-AWQ | AWQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Complete — best c=10, 114.10 tok/s, $3.43/M |
test_single |
vLLM via DJL-LMI | TinyLlama-1.1B-Chat-v1.0 | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Smoke config — c=[1,2] only, best c=2, 173.19 tok/s, $2.26/M |
trtllm_fp16 |
TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Complete (precompiled engine) — best c=10, 191.99 tok/s, $2.20/M |
trtllm_fp16_g54xlarge |
TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.4xlarge | A10G 24 GB | $2.03 | Complete — best c=10, 190.55 tok/s, $2.96/M |
trtllm_awq_int4 |
TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Complete — best c=10, 493.13 tok/s, $0.86/M |
vllm_gptq_int4 |
vLLM via DJL-LMI | Mistral-7B-GPTQ | GPTQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Complete — best c=25, 436.67 tok/s, $0.90/M |
| Config / run | Backend | Model | Quantization | Instance | Compute | $/hr | Final status |
|---|---|---|---|---|---|---|---|
llamacpp-mistral-q4km-m5 |
llama.cpp (BYOC) | Mistral-7B GGUF Q4_K_M | GGUF Q4_K_M | ml.m5.xlarge | CPU only | $0.23 | Failed — endpoint launched, but warmup failed 10/10 requests; excluded from the canonical metric bundle |
llamacpp-mistral-q4km-m52 |
llama.cpp (BYOC) | Mistral-7B GGUF Q4_K_M | GGUF Q4_K_M | ml.m5.2xlarge | CPU only | $0.46 | Executed, limited viability — c=1 usable (6.32 tok/s, 57593.96 ms P95, $20.22/M); c=5 had 99% errors; c=10,25 had no successful requests |
llamacpp-mistral-q4km-m54 |
llama.cpp (BYOC) | Mistral-7B GGUF Q4_K_M | GGUF Q4_K_M | ml.m5.4xlarge | CPU only | $0.92 | Executed, limited viability — c=1 usable (11.73 tok/s, 31767.88 ms P95, $21.79/M); c=5 had 98% errors; c=10,25 had no successful requests |
llamacpp_tinyllama_11b_q4km_m54 |
llama.cpp (BYOC) | TinyLlama-1.1B-Chat-v1.0 GGUF | GGUF Q4_K_M | ml.m5.4xlarge | CPU only | $0.92 | Executed, low/mid concurrency clean — c=1,5,10 were error-free; best point c=1, 52.66 tok/s, $4.85/M; c=25 had 15% errors |
llamacpp_llama32_1b_q4km_m54 |
llama.cpp (BYOC) | Llama-3.2-1B-Instruct GGUF | GGUF Q4_K_M | ml.m5.4xlarge | CPU only | $0.92 | Executed, low/mid concurrency clean — c=1,5,10 were error-free; best point c=1, 72.47 tok/s, $3.53/M; c=25 had 76% errors |
llamacpp_qwen25_15b_q4km_m54 |
llama.cpp (BYOC) | Qwen2.5-1.5B-Instruct GGUF | GGUF Q4_K_M | ml.m5.4xlarge | CPU only | $0.92 | Executed, low/mid concurrency clean — c=1,5,10 were error-free; best point c=5, 60.29 tok/s, $4.24/M; c=25 had 82% errors |
Comparison notes:
- CPU and GPU results are useful for different operating points and are not directly comparable.
- Most full configs use
sharegpt_subset,c=[1,5,10,25], and 100 measured requests.- The smoke config does not share that sweep.
test_singleusesc=[1,2], 10 measured requests, andmax_output_tokens=64, so treat it as a low-cost health check rather than a strict peer in the comparison set.- llama.cpp TTFT is currently missing from the metric stream. CPU rows currently show
ttft_p95=0.0, so TTFT-specific charts and radar comparisons should be read with caution.
Each benchmark run captures per-request traces and aggregates them into:
| Metric | What it measures | Why it matters |
|---|---|---|
| TTFT (P50/P95/P99) | Time to first token | User-perceived responsiveness |
| ITL (P50/P95/P99) | Inter-token latency | Smoothness of streaming output |
| E2E Latency (P50/P95/P99) | Full request latency | SLA compliance |
| Throughput | Tokens/sec, Requests/sec | Capacity planning |
| Cost | $/million output tokens | Budget optimization |
| MBU | Model Bandwidth Utilization | Hardware efficiency (are you wasting GPU memory bandwidth?) |
| Error Rate | Failed / total requests | Reliability under load |
┌──────────────────────────────────────────────┐
│ YAML Configs │
│ (backend, quantization, instance, params) │
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────┐ ┌──────────────────────────────────┐ ┌──────────────────┐
│ │ │ Endpoint Manager │ │ │
│ Config Loader │───▶│ Deploy ──▶ Health Check ──▶ Yield │───▶│ SageMaker │
│ (Pydantic + │ │ (context manager with finally │ │ Endpoint │
│ YAML merge) │ │ guarantees teardown) │ │ (InService) │
│ │ └──────────────────────────────────┘ └────────┬─────────┘
└──────────────────┘ │
▼
┌──────────────────┐ ┌──────────────────────────────────┐ ┌──────────────────┐
│ │ │ Metrics Collector │ │ │
│ Pareto Analysis │◀───│ TTFT, ITL, E2E percentiles │◀───│ Load Generator │
│ + Plotly Charts │ │ Throughput, Cost, MBU │ │ (async, bounded │
│ + Streamlit │ │ │ │ concurrency) │
│ │ └──────────────────────────────────┘ └──────────────────┘
└───────┬──────────┘
│
▼
┌──────────────────┐ ┌──────────────────────────────────┐
│ results/ │ │ MLflow Tracking (optional) │
│ - JSON metrics │ │ SageMaker Model Registry │
│ - HTML charts │ │ (registers Pareto-optimal │
│ │ │ configs for prod deployment) │
└──────────────────┘ └──────────────────────────────────┘
Key design decisions:
- Guaranteed cleanup —
EndpointManager.managed_endpoint()uses afinallyblock + SIGTERM handler so endpoints never get orphaned - Cost safety — benchmarks exceeding $10 estimated cost are blocked unless
--allow-high-costis passed - Structured logging — JSON-formatted logs with correlation IDs via structlog for production traceability
- Security hardened — IAM scoped to
llm-bench-*resources, KMS encryption on S3, bandit + pip-audit in CI - Config inheritance — each YAML overrides
configs/base.yaml, keeping configs DRY - Deterministic prompts — seeded RNG ensures identical inputs across runs for reproducibility
sagemaker-llm-inference-optimizer/
│
├── src/
│ ├── config.py # Pydantic config system (YAML + .env)
│ ├── logging_config.py # Structured logging setup (structlog + JSON)
│ ├── deploy/
│ │ ├── base.py # Abstract deployer interface + shared helpers
│ │ ├── vllm_deployer.py # vLLM on SageMaker LMI (DJL container)
│ │ ├── trtllm_deployer.py # TensorRT-LLM (extends vLLM deployer)
│ │ ├── llamacpp_deployer.py # Custom BYOC container on CPU
│ │ ├── endpoint_manager.py # Lifecycle manager + cleanup guarantees
│ │ └── container_utils.py # Image URI resolution helpers
│ ├── benchmark/
│ │ ├── runner.py # CLI entry-point + orchestration
│ │ ├── load_generator.py # Async concurrent request engine
│ │ ├── metrics_collector.py # Percentile aggregation + cost calc
│ │ └── prompts.py # Prompt datasets (ShareGPT-like, synthetic)
│ ├── analysis/
│ │ ├── pareto.py # 2D + N-dimensional Pareto frontier
│ │ ├── cost_calculator.py # $/token projections
│ │ ├── mbu.py # Model bandwidth utilization estimation
│ │ └── visualizations.py # 8 Plotly chart generators + HTML report
│ ├── tracking/
│ │ ├── mlflow_logger.py # MLflow integration (optional)
│ │ └── model_registry.py # SageMaker Model Registry
│ └── dashboard/
│ └── app.py # Streamlit interactive dashboard
│
├── configs/ # Benchmark YAML configurations
│ ├── base.yaml # Shared defaults (model, benchmark params)
│ ├── vllm_fp16.yaml
│ ├── vllm_awq_int4.yaml
│ ├── vllm_gptq_int4.yaml
│ ├── vllm_llama32_1b_fp16.yaml
│ ├── vllm_llama31_8b_fp16.yaml
│ ├── trtllm_fp16.yaml
│ ├── trtllm_fp16_g54xlarge.yaml
│ ├── trtllm_awq_int4.yaml
│ ├── llamacpp_gguf_q4km.yaml
│ ├── llamacpp_llama32_1b_q4km_m54.yaml
│ ├── llamacpp_qwen25_15b_q4km_m54.yaml
│ ├── llamacpp_tinyllama_11b_q4km_m54.yaml
│ ├── test_single.yaml # Minimal smoke-test config (fast, low cost)
│ ├── environments/ # Environment-specific overrides (dev/staging/prod)
│ ├── full_gpu/ # Preset sweep: vLLM + TRT-LLM on ml.g5
│ │ ├── base.yaml
│ │ └── *.yaml
│ └── vllm_only/ # Preset sweep: vLLM-only subset
│ ├── base.yaml
│ └── *.yaml
│
├── docs/
│ ├── images/ # Chart PNGs from benchmark runs
│ ├── EXPLAINED.md # Deep-dive: design decisions and trade-offs
│ ├── RUNBOOK.md # Production operations runbook
│ └── trtllm-precompiled-engine.md
│
├── tests/ # Pytest suite (70 tests, moto-mocked AWS)
├── infrastructure/terraform/ # IaC for S3, ECR, IAM, CloudWatch alarms
├── scripts/ # AWS bootstrap + container build helpers
├── reports/ # Benchmark report with full result tables
├── environment.yml # Conda environment definition
├── Makefile # All commands (conda-based)
├── Dockerfile.llamacpp # Multi-stage build for llama.cpp BYOC
└── .github/workflows/ # CI (lint + type check + test) + benchmark dispatch
make env
conda activate sagemaker-llm-optimizermake lint
make testcp .env.example .envEdit .env with your values:
AWS_DEFAULT_REGION=us-east-1
SAGEMAKER_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerBenchmarkRole
SAGEMAKER_S3_BUCKET=my-benchmark-bucket
ECR_REPO_URI=123456789012.dkr.ecr.us-east-1.amazonaws.com/llama-cpp-server # only for llama.cpp
HF_TOKEN=hf_... # only for gated models (Llama)
make setup-aws # Creates S3 bucket, ECR repo, IAM role
make deploy-infra # (Optional) Terraform for CloudWatch alarmsmake dry-runOutput:
Validated 13 configs
- llamacpp-mistral7b-gguf-q4km-m5: backend=llamacpp, instance=ml.m5.xlarge, $/hr=0.23, concurrency=[1, 5, 10, 25]
- llamacpp-llama32-1b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-qwen15b-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- llamacpp-tinyllama-q4km-m54: backend=llamacpp, instance=ml.m5.4xlarge, $/hr=0.92, concurrency=[1, 5, 10, 25]
- vllm-tinyllama-1.1b-fp16-g5-test: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 2]
- trtllm-mistral7b-awq-int4-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5: backend=trtllm, instance=ml.g5.2xlarge, $/hr=1.52, concurrency=[1, 5, 10, 25]
- trtllm-mistral7b-fp16-g5-4xlarge: backend=trtllm, instance=ml.g5.4xlarge, $/hr=2.03, concurrency=[1, 5, 10, 25]
- vllm-qwen25-7b-awq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-mistral7b-gptq-int4-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama31-8b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
- vllm-llama32-1b-fp16-g5: backend=vllm, instance=ml.g5.xlarge, $/hr=1.41, concurrency=[1, 5, 10, 25]
# Single config
make benchmark-single CONFIG=configs/vllm_fp16.yaml
# All configs in the top-level configs/ directory
make benchmark
# Preset sweep — vLLM only (3 configs)
python -m src.benchmark.runner --config-dir configs/vllm_only
# Preset sweep — full GPU (vLLM + TRT-LLM, 5 configs)
python -m src.benchmark.runner --config-dir configs/full_gpu
# Scan all config subdirectories in one pass
python -m src.benchmark.runner --config-dir configs/ --recursivemake dashboard # Streamlit at http://localhost:8501
make sample-report # Generate synthetic demo data (no AWS needed)| Command | Description |
|---|---|
make env |
Create the conda environment from environment.yml |
make env-update |
Update an existing conda environment |
make lint |
Lint (ruff) + type check (mypy) |
make test |
Run the full pytest suite |
make dry-run |
Validate all configs, print cost estimates (no AWS calls) |
make benchmark |
Run all configs in configs/ end-to-end |
make benchmark-single CONFIG=<path> |
Run a single config |
make dashboard |
Launch Streamlit results dashboard |
make sample-report |
Generate synthetic demo data for the dashboard |
make clean |
Delete ALL llm-bench-* SageMaker endpoints (safety sweep) |
make build-llamacpp |
Build + push the llama.cpp BYOC container to ECR |
make setup-aws |
Bootstrap S3, ECR, IAM via helper script |
make deploy-infra |
Provision Terraform resources (monitoring, alarms) |
make destroy-infra |
Tear down Terraform resources |
python -m src.benchmark.runner [OPTIONS]
--config PATH Single config YAML to run
--config-dir PATH Directory of config YAMLs [default: configs/]
--recursive Also scan config subdirectories (full_gpu/, vllm_only/)
--instance-type TEXT Override endpoint instance type
--instance-cost-per-hour FLOAT Override hourly cost
--model-s3-uri TEXT Override MODEL_S3_URI (llama.cpp/GGUF)
--model-data-url TEXT Override model_data_url (TRT-LLM precompiled)
--dry-run Validate configs only, no AWS calls
--allow-high-cost Allow benchmarks with estimated cost > $10
If you already have a compiled TRT-LLM engine in S3, deploy it directly via model_data_url instead of compiling from Hugging Face at endpoint startup (which takes 15–30 min on a g5).
name: "trtllm-mistral7b-fp16-precompiled-g5-4xlarge"
model:
model_id: "mistralai/Mistral-7B-Instruct-v0.3"
model_data_url: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/model.tar.gz"
quantization: "fp16"
backend: "trtllm"
endpoint:
instance_type: "ml.g5.4xlarge"
instance_cost_per_hour: 2.03
container_startup_timeout: 3600Helper scripts:
scripts/trtllm_precompile_train.sh— container-side compile + bundle scriptscripts/create_trtllm_precompile_training_job.py— submit a SageMaker Training Job that outputsmodel.tar.gz
See docs/trtllm-precompiled-engine.md for the full workflow.
Endpoints cost money while running. The runner tears down each endpoint automatically in a finally block. If a run is interrupted (Ctrl+C, SSH drop, OOM kill), endpoints may be orphaned:
# Delete ALL llm-bench-* endpoints in your account
make clean
# Verify nothing is still running
aws sagemaker list-endpoints --region us-east-1 \
--query 'Endpoints[?starts_with(EndpointName, `llm-bench-`)].{Name:EndpointName,Status:EndpointStatus}' \
--output tableFull resource teardown:
make clean # 1. endpoints
make destroy-infra # 2. Terraform (CloudWatch alarms)
aws s3 rb s3://<your-bucket> --force # 3. S3 buckets
aws ecr delete-repository --repository-name llm-inference-optimizer-byoc --force --region us-east-1Gated Meta Llama configs (vllm_llama31_8b_fp16, vllm_llama32_1b_fp16) — These configs require HF_TOKEN to be set in your .env file. Without it, the endpoint will fail to download the gated model weights.
TRT-LLM FP16 on ml.g5.2xlarge without precompiled artifacts — The direct startup path times out before reaching InService. Use a pre-compiled engine via model_data_url instead (see Pre-compiled TRT-LLM Engine).
llama.cpp BYOC on ml.m5.xlarge for Mistral 7B — The run executed, but warmup failed 10/10 requests with container timeout / backend-not-ready errors. This is the current lower-bound failure point in the CPU sizing sweep.
llama.cpp BYOC on ml.m5.2xlarge / ml.m5.4xlarge for Mistral 7B — These sizes can produce metric rows, but are only realistically healthy at c=1. Higher concurrency levels degrade sharply (98%+ error rate by c=5, full collapse by c=10).
llama.cpp TTFT instrumentation — CPU llama.cpp rows currently record ttft_p95=0.0, which means TTFT-specific charts and radar comparisons across backends are not trustworthy yet. E2E latency, throughput, and error-rate fields are still useful.
GPTQ and DJL-LMI version — vllm_gptq_int4 requires DJL-LMI 0.36.0+ (the default). Older versions (0.32.0 / vLLM 0.7.3) hit a partial_rotary_factor failure that has since been resolved.
| Path | Description |
|---|---|
results/audited-2026-04-07-current/all_metrics.json |
Merged metric bundle (all backends, all concurrency levels) |
results/audited-2026-04-07-current/summary.json |
Bundle-level summary with chart paths |
results/audited-2026-04-07-current/audit_summary.json |
Audit outcome, warnings, failed configs, and notes |
results/audited-2026-04-07-current/canonical_selection.json |
Source-directory mapping for each config |
results/audited-2026-04-07-current/*.html |
Interactive Plotly charts (Pareto, latency, throughput, cost, MBU, radar, heatmap, bubble) |
results/audited-2026-04-07-current/benchmark_report.html |
Single merged HTML report with all charts |
results/archive-2026-04-07/ |
Historical raw artifact sets and prior benchmark runs |
results/archive-2026-04-07/sample/ |
Synthetic demo data for offline dashboard testing |
reports/latest-benchmark-report.md |
Methodology and status report |
reports/public-vllm-validated-2026-04-06.md |
Public-vLLM-only validated baseline report |