Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.venv/
.DS_Store
__pycache__/
*.pyc
.pytest_cache/
Expand All @@ -14,4 +15,5 @@ HF_TOKEN
*.tfstate.*
results/*.json
results/*.html
results/*.log
!results/.gitkeep
81 changes: 74 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ This tool runs controlled experiments across all of these and gives you the numb
| `vllm_llama31_8b_fp16` | vLLM via DJL-LMI | Llama-3.1-8B-Instruct | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
| `vllm_fp16` | vLLM via DJL-LMI | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
| `vllm_awq_int4` | vLLM via DJL-LMI | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
| `trtllm_fp16` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Retrying |
| `trtllm_awq_int4` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Retrying |
| `trtllm_fp16` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Failed |
| `trtllm_fp16_g54xlarge` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.4xlarge | A10G 24 GB | $2.03 | Tested |
| `trtllm_awq_int4` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Pending |
| `vllm_gptq_int4` | vLLM via DJL-LMI | Mistral-7B-GPTQ | GPTQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Incompatible |

#### CPU Backends
Expand Down Expand Up @@ -158,6 +159,7 @@ sagemaker-llm-inference-optimizer/
│ ├── vllm_awq_int4.yaml
│ ├── vllm_gptq_int4.yaml
│ ├── trtllm_fp16.yaml
│ ├── trtllm_fp16_g54xlarge.yaml
│ ├── trtllm_awq_int4.yaml
│ ├── llamacpp_gguf_q4km.yaml
│ ├── full_gpu/ # Preset sweep (vLLM + TRT-LLM on ml.g5)
Expand All @@ -177,6 +179,35 @@ sagemaker-llm-inference-optimizer/
└── .github/workflows/ # CI (lint + test) + benchmark dispatch
```

### Pre-compiled TRT-LLM engine flow

If you already have a compiled TRT-LLM engine bundle in S3, you can deploy it directly via
SageMaker `ModelDataUrl` instead of compiling from Hugging Face at endpoint startup.

Example config:

```yaml
name: "trtllm-mistral7b-fp16-precompiled-g5-4xlarge"

model:
model_id: "mistralai/Mistral-7B-Instruct-v0.3"
model_data_url: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/model.tar.gz"
quantization: "fp16"
backend: "trtllm"

endpoint:
instance_type: "ml.g5.4xlarge"
instance_cost_per_hour: 2.03
container_startup_timeout: 3600
```

Notes:
- When `model.model_data_url` is set, the deployer uses SageMaker `ModelDataUrl` and does **not** auto-inject `HF_MODEL_ID`.
- The bundled model artifact should contain a `serving.properties` plus the compiled TRT-LLM repo contents.
- Helper scripts:
- `scripts/trtllm_precompile_train.sh` — container-side compile + bundle script
- `scripts/create_trtllm_precompile_training_job.py` — submit a SageMaker training job that produces `model.tar.gz`

---

## Quick Start
Expand Down Expand Up @@ -345,7 +376,8 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
## Experimental Results

> **Last updated:** 2026-03-28
> Full tables, methodology, and issue log: **[`reports/latest-benchmark-report.md`](reports/latest-benchmark-report.md)**
> Full tables, methodology, and historical issue log: **[`reports/latest-benchmark-report.md`](reports/latest-benchmark-report.md)**
> Latest TRT-LLM `ml.g5.4xlarge` run is summarized below.

### Cross-Model Comparison (@ c=25, ml.g5.xlarge)

Expand All @@ -355,12 +387,45 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
| vLLM Mistral-7B AWQ | 7B | 875.8 | 5712.4 | $0.45 | 0% |
| vLLM Llama-3.1-8B FP16 | 8B | 421.1 | 11264.4 | $0.93 | 0% |

### TRT-LLM Mistral-7B FP16 (@ ml.g5.4xlarge)

| Concurrency | Throughput (tok/s) | E2E P95 (ms) | $/M Tokens | Errors |
|------------|:------------------:|:------------:|:----------:|:------:|
| 1 | 36.5 | 7728.3 | $15.45 | 0% |
| 5 | 167.5 | 8473.2 | $3.37 | 0% |
| 10 | 190.9 | 8722.1 | $2.95 | 0% |
| 25 | **191.3** | 8718.3 | **$2.95** | 0% |

> Notes: This run required (1) wiring `ContainerStartupHealthCheckTimeoutInSeconds` into the SageMaker endpoint config and (2) removing `OPTION_MAX_MODEL_LEN` for TRT-LLM to avoid a DJL/TensorRT-LLM `max_model_len` initialization bug on Mistral.

### Throughput Comparison (tok/s @ c=25)

```
vLLM Llama-3.2-1B FP16 |████████████████████████████████████████| 1687.0 tok/s
vLLM Mistral-7B AWQ |████████████████████░░░░░░░░░░░░░░░░░░░░| 875.8 tok/s
vLLM Llama-3.1-8B FP16 |██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 421.1 tok/s
TRT-LLM Mistral-7B FP16 |████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 191.3 tok/s
0 500 1000 1500 2000
```

### Cost Efficiency ($/M tokens @ c=25)

```
vLLM Llama-3.2-1B FP16 |███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| $0.23 Best
vLLM Mistral-7B AWQ |██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| $0.45
vLLM Llama-3.1-8B FP16 |████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░| $0.93
TRT-LLM Mistral-7B FP16 |████████████████████████████████████████░| $2.95 Worst
$0 $0.50 $1.00 $1.50 $3.00
```

### Key Findings

1. **Smaller models win on cost** — Llama-3.2-1B at $0.23/M tokens is 4x cheaper than the 8B at $0.93/M
2. **Concurrency is the biggest cost lever** — costs drop 9-10x from c=1 to c=25 across all models
3. **AWQ quantization works well** — Mistral-7B AWQ achieves 875 tok/s with zero errors, competitive with FP16
4. **vLLM FP16 Mistral-7B has lowest latency** — 706ms P95 at c=1, best for latency-sensitive apps
4. **vLLM FP16 Mistral-7B still has the lowest latency** — 706ms P95 at c=1, best for latency-sensitive apps
5. **TRT-LLM is not competitive on this stack** — even after fixing startup issues, Mistral-7B FP16 on ml.g5.4xlarge tops out at 191 tok/s and $2.95/M tokens, well behind vLLM on the smaller/cheaper ml.g5.xlarge
6. **vLLM consistently outperforms TRT-LLM** — on cheaper hardware (g5.xlarge vs g5.4xlarge), vLLM delivers 2-9x higher throughput

### Benchmark Status

Expand All @@ -370,16 +435,18 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
| vLLM FP16 — Llama-3.1-8B (g5.xlarge) | **Completed** | 4 concurrency levels, 0% errors |
| vLLM FP16 — Mistral-7B (g5.xlarge) | **Completed** | 2 concurrency levels |
| vLLM AWQ-INT4 — Mistral-7B (g5.xlarge) | **Completed** | 4 concurrency levels, 0% errors |
| TRT-LLM FP16 — Mistral-7B (g5.2xlarge) | **Failed** | Engine compilation exceeds 1800s health check timeout |
| TRT-LLM AWQ-INT4 — Mistral-7B (g5.2xlarge) | **Failed** | Same — requires pre-compiled engine |
| TRT-LLM FP16 — Mistral-7B (g5.2xlarge) | **Failed** | `convert_checkpoint.py` was killed during startup (exit 137); endpoint never became healthy |
| TRT-LLM FP16 — Mistral-7B (g5.4xlarge) | **Completed** | 4 concurrency levels, 0% errors; required real SageMaker startup timeout + removing `OPTION_MAX_MODEL_LEN` |
| TRT-LLM AWQ-INT4 — Mistral-7B (g5.2xlarge) | **Pending** | Not rerun after FP16 fixes; likely needs larger instance or pre-compiled engine |
| llama.cpp GGUF — Mistral-7B (m5.xlarge) | **Blocked** | BYOC container needs rebuild (missing shared lib) |

### Known Incompatibilities

| Config | Issue | Recommendation |
|--------|-------|----------------|
| **vLLM GPTQ-INT4** (`TheBloke/Mistral-7B-Instruct-v0.2-GPTQ`) | Model's `config.json` missing `partial_rotary_factor` — crashes vLLM 0.7.3 in DJL-LMI 0.32.0 | Use AWQ quantization instead |
| **TRT-LLM** (Mistral-7B on g5.xlarge/2xlarge) | TRT-LLM engine compilation takes >30min, killed by SageMaker health check | Pre-compile engine + cache in S3, or use larger instance |
| **TRT-LLM** (Mistral-7B on g5.2xlarge) | `convert_checkpoint.py` is killed during startup (`exit 137`) before the endpoint becomes healthy | Use `ml.g5.4xlarge` or a pre-compiled engine |
| **TRT-LLM** (DJL 0.32.0 / TensorRT-LLM 0.12.0 with `OPTION_MAX_MODEL_LEN`) | Mistral startup can crash with `GeneralEngineConfig.__init__() got multiple values for argument 'max_model_len'` | Omit `OPTION_MAX_MODEL_LEN` for TRT-LLM on this stack |
| **llama.cpp BYOC** | `libmtmd.so.0` missing — latest llama.cpp main added multimodal dependency | Dockerfile fixed (pinned to tag b5460), awaiting rebuild |

### Detailed Reports
Expand Down
1 change: 0 additions & 1 deletion configs/trtllm_awq_int4.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ model:
env_vars:
HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
OPTION_ROLLING_BATCH: "trtllm"
OPTION_MAX_MODEL_LEN: "4096"
OPTION_TENSOR_PARALLEL_DEGREE: "1"
OPTION_DTYPE: "fp16"
OPTION_QUANTIZE: "awq"
Expand Down
3 changes: 1 addition & 2 deletions configs/trtllm_fp16.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,10 @@ model:
env_vars:
HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
OPTION_ROLLING_BATCH: "trtllm"
OPTION_MAX_MODEL_LEN: "4096"
OPTION_TENSOR_PARALLEL_DEGREE: "1"
OPTION_DTYPE: "fp16"

endpoint:
instance_type: "ml.g5.2xlarge"
instance_cost_per_hour: 1.52
container_startup_timeout: 1800
container_startup_timeout: 3600
15 changes: 15 additions & 0 deletions configs/trtllm_fp16_g54xlarge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: "trtllm-mistral7b-fp16-g5-4xlarge"

model:
quantization: "fp16"
backend: "trtllm"
env_vars:
HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
OPTION_ROLLING_BATCH: "trtllm"
OPTION_TENSOR_PARALLEL_DEGREE: "1"
OPTION_DTYPE: "fp16"

endpoint:
instance_type: "ml.g5.4xlarge"
instance_cost_per_hour: 2.03
container_startup_timeout: 3600
196 changes: 196 additions & 0 deletions docs/trtllm-precompiled-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# TRT-LLM Pre-compiled Engine Guide

TRT-LLM compiles a CUDA engine at startup, which takes 20-40 minutes and exceeds
SageMaker's health check timeout. The solution: compile once offline, upload to S3,
and point the endpoint at the cached engine.

## Prerequisites

- AWS CLI configured with your credentials
- A GPU instance (EC2 g5.2xlarge or larger) for compilation
- Your S3 bucket: `$SAGEMAKER_S3_BUCKET`
- HuggingFace token for gated models: `$HF_TOKEN`

## Step 1 — Launch an EC2 GPU Instance

```bash
# Launch a g5.2xlarge (same GPU arch as SageMaker g5 instances)
aws ec2 run-instances \
--image-id ami-0a0e5d9c7acc336f1 \
--instance-type g5.2xlarge \
--key-name your-key \
--security-group-ids sg-xxx \
--subnet-id subnet-xxx \
--block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=trtllm-compile}]'
```

SSH into the instance once it's running.

## Step 2 — Pull the DJL TensorRT-LLM Container

```bash
# Get the same container image SageMaker uses
# Replace REGION and ACCOUNT with your DJL image registry
# You can find the exact URI by running locally:
# python -c "from sagemaker import image_uris; print(image_uris.retrieve('djl-tensorrtllm', 'us-east-1', version='0.32.0'))"

IMAGE="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-tensorrtllm0.16.0-cu128-full"

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
docker pull $IMAGE
```

## Step 3 — Compile the TRT-LLM Engine

### FP16 Engine

```bash
docker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /dev/null \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3 \
--dtype fp16 \
--max_input_len 4096 \
--max_num_tokens 16384
```

If the above doesn't work with `/dev/null` as properties_dir, create a properties file:

```bash
mkdir -p /tmp/trtllm-props
cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF

docker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-v /tmp/trtllm-props:/tmp/trtllm-props \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /tmp/trtllm-props \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3
```

### AWQ-INT4 Engine

Same as above but add `--quantize awq`:

```bash
cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.quantize=awq
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF

docker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-v /tmp/trtllm-props:/tmp/trtllm-props \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /tmp/trtllm-props \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-awq \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3
```

## Step 4 — Package and Upload to S3

```bash
# Package the compiled engine into a model.tar.gz
cd /tmp/trtllm-cache/mistral-7b-fp16
tar czf /tmp/mistral-7b-fp16-trtllm.tar.gz .

# Upload to S3
aws s3 cp /tmp/mistral-7b-fp16-trtllm.tar.gz \
s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-fp16/model.tar.gz

# Repeat for AWQ if compiled
cd /tmp/trtllm-cache/mistral-7b-awq
tar czf /tmp/mistral-7b-awq-trtllm.tar.gz .
aws s3 cp /tmp/mistral-7b-awq-trtllm.tar.gz \
s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-awq/model.tar.gz
```

## Step 5 — Update Config to Use Pre-compiled Engine

Modify `configs/trtllm_fp16.yaml` to point at the S3 engine:

```yaml
name: "trtllm-mistral7b-fp16-g5"

model:
quantization: "fp16"
backend: "trtllm"
env_vars:
# Point to the pre-compiled engine in S3 instead of HuggingFace model ID
OPTION_MODEL_ID: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/"
OPTION_ROLLING_BATCH: "trtllm"
OPTION_MAX_MODEL_LEN: "4096"
OPTION_TENSOR_PARALLEL_DEGREE: "1"
OPTION_DTYPE: "fp16"

endpoint:
instance_type: "ml.g5.2xlarge"
instance_cost_per_hour: 1.52
container_startup_timeout: 600 # Pre-compiled engine loads in <5 min
```

**Key change:** `HF_MODEL_ID` → `OPTION_MODEL_ID` pointing to the S3 path with the
pre-compiled engine. The container will download and load the engine directly instead
of compiling from scratch.

## Step 6 — Run the Benchmark

```bash
# Also need to update model_registry.py to include ModelDataUrl for trtllm
# if using model.tar.gz approach — OR use the OPTION_MODEL_ID env var approach above

conda run -n sagemaker-llm-optimizer python -m src.benchmark.runner \
--config configs/trtllm_fp16.yaml
```

## Step 7 — Cleanup

```bash
# Terminate the EC2 compilation instance
aws ec2 terminate-instances --instance-ids i-xxx

# Optionally delete the S3 engines if no longer needed
# aws s3 rm s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/ --recursive
```

## Notes

- The compiled engine is **GPU-architecture-specific**. A g5 (A10G) engine won't
work on p4d (A100) or p5 (H100) instances. Compile on the same GPU family you'll deploy to.
- Engine compilation typically takes 20-40 min for a 7B model. Larger models take longer.
- The `model.tar.gz` approach requires the SageMaker execution role to have `s3:GetObject`
on the engine bucket (already configured if you ran `setup_aws.sh`).
- DJL 0.32.0 uses TensorRT-LLM 0.16.0. If you upgrade the DJL version, you may need to
recompile the engine.
Loading