varad-more · varad-more · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .venv/
+.DS_Store
 __pycache__/
 *.pyc
 .pytest_cache/
@@ -14,4 +15,5 @@ HF_TOKEN
 *.tfstate.*
 results/*.json
 results/*.html
+results/*.log
 !results/.gitkeep
diff --git a/README.md b/README.md
@@ -49,8 +49,9 @@ This tool runs controlled experiments across all of these and gives you the numb
 | `vllm_llama31_8b_fp16` | vLLM via DJL-LMI | Llama-3.1-8B-Instruct | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
 | `vllm_fp16` | vLLM via DJL-LMI | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
 | `vllm_awq_int4` | vLLM via DJL-LMI | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Tested |
-| `trtllm_fp16` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Retrying |
-| `trtllm_awq_int4` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Retrying |
+| `trtllm_fp16` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Failed |
+| `trtllm_fp16_g54xlarge` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | FP16 | ml.g5.4xlarge | A10G 24 GB | $2.03 | Tested |
+| `trtllm_awq_int4` | TensorRT-LLM via DJL | Mistral-7B-Instruct-v0.3 | AWQ-INT4 | ml.g5.2xlarge | A10G 24 GB | $1.52 | Pending |
 | `vllm_gptq_int4` | vLLM via DJL-LMI | Mistral-7B-GPTQ | GPTQ-INT4 | ml.g5.xlarge | A10G 24 GB | $1.41 | Incompatible |
 
 #### CPU Backends
@@ -158,6 +159,7 @@ sagemaker-llm-inference-optimizer/
 │   ├── vllm_awq_int4.yaml
 │   ├── vllm_gptq_int4.yaml
 │   ├── trtllm_fp16.yaml
+│   ├── trtllm_fp16_g54xlarge.yaml
 │   ├── trtllm_awq_int4.yaml
 │   ├── llamacpp_gguf_q4km.yaml
 │   ├── full_gpu/                 # Preset sweep (vLLM + TRT-LLM on ml.g5)
@@ -177,6 +179,35 @@ sagemaker-llm-inference-optimizer/
 └── .github/workflows/             # CI (lint + test) + benchmark dispatch
 ```
 
+### Pre-compiled TRT-LLM engine flow
+
+If you already have a compiled TRT-LLM engine bundle in S3, you can deploy it directly via
+SageMaker `ModelDataUrl` instead of compiling from Hugging Face at endpoint startup.
+
+Example config:
+
+```yaml
+name: "trtllm-mistral7b-fp16-precompiled-g5-4xlarge"
+
+model:
+  model_id: "mistralai/Mistral-7B-Instruct-v0.3"
+  model_data_url: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/model.tar.gz"
+  quantization: "fp16"
+  backend: "trtllm"
+
+endpoint:
+  instance_type: "ml.g5.4xlarge"
+  instance_cost_per_hour: 2.03
+  container_startup_timeout: 3600
+```
+
+Notes:
+- When `model.model_data_url` is set, the deployer uses SageMaker `ModelDataUrl` and does **not** auto-inject `HF_MODEL_ID`.
+- The bundled model artifact should contain a `serving.properties` plus the compiled TRT-LLM repo contents.
+- Helper scripts:
+  - `scripts/trtllm_precompile_train.sh` — container-side compile + bundle script
+  - `scripts/create_trtllm_precompile_training_job.py` — submit a SageMaker training job that produces `model.tar.gz`
+
 ---
 
 ## Quick Start
@@ -345,7 +376,8 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
 ## Experimental Results
 
 > **Last updated:** 2026-03-28
-> Full tables, methodology, and issue log: **[`reports/latest-benchmark-report.md`](reports/latest-benchmark-report.md)**
+> Full tables, methodology, and historical issue log: **[`reports/latest-benchmark-report.md`](reports/latest-benchmark-report.md)**
+> Latest TRT-LLM `ml.g5.4xlarge` run is summarized below.
 
 ### Cross-Model Comparison (@ c=25, ml.g5.xlarge)
 
@@ -355,12 +387,45 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
 | vLLM Mistral-7B AWQ | 7B | 875.8 | 5712.4 | $0.45 | 0% |
 | vLLM Llama-3.1-8B FP16 | 8B | 421.1 | 11264.4 | $0.93 | 0% |
 
+### TRT-LLM Mistral-7B FP16 (@ ml.g5.4xlarge)
+
+| Concurrency | Throughput (tok/s) | E2E P95 (ms) | $/M Tokens | Errors |
+|------------|:------------------:|:------------:|:----------:|:------:|
+| 1 | 36.5 | 7728.3 | $15.45 | 0% |
+| 5 | 167.5 | 8473.2 | $3.37 | 0% |
+| 10 | 190.9 | 8722.1 | $2.95 | 0% |
+| 25 | **191.3** | 8718.3 | **$2.95** | 0% |
+
+> Notes: This run required (1) wiring `ContainerStartupHealthCheckTimeoutInSeconds` into the SageMaker endpoint config and (2) removing `OPTION_MAX_MODEL_LEN` for TRT-LLM to avoid a DJL/TensorRT-LLM `max_model_len` initialization bug on Mistral.
+
+### Throughput Comparison (tok/s @ c=25)
+
+```
+vLLM Llama-3.2-1B FP16  |████████████████████████████████████████| 1687.0 tok/s
+vLLM Mistral-7B AWQ     |████████████████████░░░░░░░░░░░░░░░░░░░░|  875.8 tok/s
+vLLM Llama-3.1-8B FP16  |██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░|  421.1 tok/s
+TRT-LLM Mistral-7B FP16 |████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░|  191.3 tok/s
+                          0        500       1000      1500     2000
+```
+
+### Cost Efficiency ($/M tokens @ c=25)
+
+```
+vLLM Llama-3.2-1B FP16  |███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░|  $0.23  Best
+vLLM Mistral-7B AWQ     |██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░|  $0.45
+vLLM Llama-3.1-8B FP16  |████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░|  $0.93
+TRT-LLM Mistral-7B FP16 |████████████████████████████████████████░| $2.95  Worst
+                          $0       $0.50     $1.00     $1.50    $3.00
+```
+
 ### Key Findings
 
 1. **Smaller models win on cost** — Llama-3.2-1B at $0.23/M tokens is 4x cheaper than the 8B at $0.93/M
 2. **Concurrency is the biggest cost lever** — costs drop 9-10x from c=1 to c=25 across all models
 3. **AWQ quantization works well** — Mistral-7B AWQ achieves 875 tok/s with zero errors, competitive with FP16
-4. **vLLM FP16 Mistral-7B has lowest latency** — 706ms P95 at c=1, best for latency-sensitive apps
+4. **vLLM FP16 Mistral-7B still has the lowest latency** — 706ms P95 at c=1, best for latency-sensitive apps
+5. **TRT-LLM is not competitive on this stack** — even after fixing startup issues, Mistral-7B FP16 on ml.g5.4xlarge tops out at 191 tok/s and $2.95/M tokens, well behind vLLM on the smaller/cheaper ml.g5.xlarge
+6. **vLLM consistently outperforms TRT-LLM** — on cheaper hardware (g5.xlarge vs g5.4xlarge), vLLM delivers 2-9x higher throughput
 
 ### Benchmark Status
 
@@ -370,16 +435,18 @@ aws iam delete-role --role-name llm-inference-optimizer-sagemaker-exec-role
 | vLLM FP16 — Llama-3.1-8B (g5.xlarge) | **Completed** | 4 concurrency levels, 0% errors |
 | vLLM FP16 — Mistral-7B (g5.xlarge) | **Completed** | 2 concurrency levels |
 | vLLM AWQ-INT4 — Mistral-7B (g5.xlarge) | **Completed** | 4 concurrency levels, 0% errors |
-| TRT-LLM FP16 — Mistral-7B (g5.2xlarge) | **Failed** | Engine compilation exceeds 1800s health check timeout |
-| TRT-LLM AWQ-INT4 — Mistral-7B (g5.2xlarge) | **Failed** | Same — requires pre-compiled engine |
+| TRT-LLM FP16 — Mistral-7B (g5.2xlarge) | **Failed** | `convert_checkpoint.py` was killed during startup (exit 137); endpoint never became healthy |
+| TRT-LLM FP16 — Mistral-7B (g5.4xlarge) | **Completed** | 4 concurrency levels, 0% errors; required real SageMaker startup timeout + removing `OPTION_MAX_MODEL_LEN` |
+| TRT-LLM AWQ-INT4 — Mistral-7B (g5.2xlarge) | **Pending** | Not rerun after FP16 fixes; likely needs larger instance or pre-compiled engine |
 | llama.cpp GGUF — Mistral-7B (m5.xlarge) | **Blocked** | BYOC container needs rebuild (missing shared lib) |
 
 ### Known Incompatibilities
 
 | Config | Issue | Recommendation |
 |--------|-------|----------------|
 | **vLLM GPTQ-INT4** (`TheBloke/Mistral-7B-Instruct-v0.2-GPTQ`) | Model's `config.json` missing `partial_rotary_factor` — crashes vLLM 0.7.3 in DJL-LMI 0.32.0 | Use AWQ quantization instead |
-| **TRT-LLM** (Mistral-7B on g5.xlarge/2xlarge) | TRT-LLM engine compilation takes >30min, killed by SageMaker health check | Pre-compile engine + cache in S3, or use larger instance |
+| **TRT-LLM** (Mistral-7B on g5.2xlarge) | `convert_checkpoint.py` is killed during startup (`exit 137`) before the endpoint becomes healthy | Use `ml.g5.4xlarge` or a pre-compiled engine |
+| **TRT-LLM** (DJL 0.32.0 / TensorRT-LLM 0.12.0 with `OPTION_MAX_MODEL_LEN`) | Mistral startup can crash with `GeneralEngineConfig.__init__() got multiple values for argument 'max_model_len'` | Omit `OPTION_MAX_MODEL_LEN` for TRT-LLM on this stack |
 | **llama.cpp BYOC** | `libmtmd.so.0` missing — latest llama.cpp main added multimodal dependency | Dockerfile fixed (pinned to tag b5460), awaiting rebuild |
 
 ### Detailed Reports

diff --git a/configs/trtllm_awq_int4.yaml b/configs/trtllm_awq_int4.yaml
@@ -6,7 +6,6 @@ model:
   env_vars:
     HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
     OPTION_ROLLING_BATCH: "trtllm"
-    OPTION_MAX_MODEL_LEN: "4096"
     OPTION_TENSOR_PARALLEL_DEGREE: "1"
     OPTION_DTYPE: "fp16"
     OPTION_QUANTIZE: "awq"

diff --git a/configs/trtllm_fp16.yaml b/configs/trtllm_fp16.yaml
@@ -6,11 +6,10 @@ model:
   env_vars:
     HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
     OPTION_ROLLING_BATCH: "trtllm"
-    OPTION_MAX_MODEL_LEN: "4096"
     OPTION_TENSOR_PARALLEL_DEGREE: "1"
     OPTION_DTYPE: "fp16"
 
 endpoint:
   instance_type: "ml.g5.2xlarge"
   instance_cost_per_hour: 1.52
-  container_startup_timeout: 1800
+  container_startup_timeout: 3600
diff --git a/configs/trtllm_fp16_g54xlarge.yaml b/configs/trtllm_fp16_g54xlarge.yaml
@@ -0,0 +1,15 @@
+name: "trtllm-mistral7b-fp16-g5-4xlarge"
+
+model:
+  quantization: "fp16"
+  backend: "trtllm"
+  env_vars:
+    HF_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
+    OPTION_ROLLING_BATCH: "trtllm"
+    OPTION_TENSOR_PARALLEL_DEGREE: "1"
+    OPTION_DTYPE: "fp16"
+
+endpoint:
+  instance_type: "ml.g5.4xlarge"
+  instance_cost_per_hour: 2.03
+  container_startup_timeout: 3600
diff --git a/docs/trtllm-precompiled-engine.md b/docs/trtllm-precompiled-engine.md
@@ -0,0 +1,196 @@
+# TRT-LLM Pre-compiled Engine Guide
+
+TRT-LLM compiles a CUDA engine at startup, which takes 20-40 minutes and exceeds
+SageMaker's health check timeout. The solution: compile once offline, upload to S3,
+and point the endpoint at the cached engine.
+
+## Prerequisites
+
+- AWS CLI configured with your credentials
+- A GPU instance (EC2 g5.2xlarge or larger) for compilation
+- Your S3 bucket: `$SAGEMAKER_S3_BUCKET`
+- HuggingFace token for gated models: `$HF_TOKEN`
+
+## Step 1 — Launch an EC2 GPU Instance
+
+```bash
+# Launch a g5.2xlarge (same GPU arch as SageMaker g5 instances)
+aws ec2 run-instances \
+  --image-id ami-0a0e5d9c7acc336f1 \
+  --instance-type g5.2xlarge \
+  --key-name your-key \
+  --security-group-ids sg-xxx \
+  --subnet-id subnet-xxx \
+  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]' \
+  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=trtllm-compile}]'
+```
+
+SSH into the instance once it's running.
+
+## Step 2 — Pull the DJL TensorRT-LLM Container
+
+```bash
+# Get the same container image SageMaker uses
+# Replace REGION and ACCOUNT with your DJL image registry
+# You can find the exact URI by running locally:
+#   python -c "from sagemaker import image_uris; print(image_uris.retrieve('djl-tensorrtllm', 'us-east-1', version='0.32.0'))"
+
+IMAGE="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-tensorrtllm0.16.0-cu128-full"
+
+aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
+docker pull $IMAGE
+```
+
+## Step 3 — Compile the TRT-LLM Engine
+
+### FP16 Engine
+
+```bash
+docker run --gpus all --rm \
+  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
+  -e HF_TOKEN=$HF_TOKEN \
+  $IMAGE \
+  python /opt/djl/partition/trt_llm_partition.py \
+    --properties_dir /dev/null \
+    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --model_path mistralai/Mistral-7B-Instruct-v0.3 \
+    --dtype fp16 \
+    --max_input_len 4096 \
+    --max_num_tokens 16384
+```
+
+If the above doesn't work with `/dev/null` as properties_dir, create a properties file:
+
+```bash
+mkdir -p /tmp/trtllm-props
+cat > /tmp/trtllm-props/serving.properties << 'EOF'
+engine=Python
+option.model_id=mistralai/Mistral-7B-Instruct-v0.3
+option.rolling_batch=trtllm
+option.tensor_parallel_degree=1
+option.dtype=fp16
+option.max_model_len=4096
+option.max_rolling_batch_size=256
+option.max_num_tokens=16384
+EOF
+
+docker run --gpus all --rm \
+  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
+  -v /tmp/trtllm-props:/tmp/trtllm-props \
+  -e HF_TOKEN=$HF_TOKEN \
+  $IMAGE \
+  python /opt/djl/partition/trt_llm_partition.py \
+    --properties_dir /tmp/trtllm-props \
+    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --model_path mistralai/Mistral-7B-Instruct-v0.3
+```
+
+### AWQ-INT4 Engine
+
+Same as above but add `--quantize awq`:
+
+```bash
+cat > /tmp/trtllm-props/serving.properties << 'EOF'
+engine=Python
+option.model_id=mistralai/Mistral-7B-Instruct-v0.3
+option.rolling_batch=trtllm
+option.tensor_parallel_degree=1
+option.dtype=fp16
+option.quantize=awq
+option.max_model_len=4096
+option.max_rolling_batch_size=256
+option.max_num_tokens=16384
+EOF
+
+docker run --gpus all --rm \
+  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
+  -v /tmp/trtllm-props:/tmp/trtllm-props \
+  -e HF_TOKEN=$HF_TOKEN \
+  $IMAGE \
+  python /opt/djl/partition/trt_llm_partition.py \
+    --properties_dir /tmp/trtllm-props \
+    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-awq \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --model_path mistralai/Mistral-7B-Instruct-v0.3
+```
+
+## Step 4 — Package and Upload to S3
+
+```bash
+# Package the compiled engine into a model.tar.gz
+cd /tmp/trtllm-cache/mistral-7b-fp16
+tar czf /tmp/mistral-7b-fp16-trtllm.tar.gz .
+
+# Upload to S3
+aws s3 cp /tmp/mistral-7b-fp16-trtllm.tar.gz \
+  s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-fp16/model.tar.gz
+
+# Repeat for AWQ if compiled
+cd /tmp/trtllm-cache/mistral-7b-awq
+tar czf /tmp/mistral-7b-awq-trtllm.tar.gz .
+aws s3 cp /tmp/mistral-7b-awq-trtllm.tar.gz \
+  s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-awq/model.tar.gz
+```
+
+## Step 5 — Update Config to Use Pre-compiled Engine
+
+Modify `configs/trtllm_fp16.yaml` to point at the S3 engine:
+
+```yaml
+name: "trtllm-mistral7b-fp16-g5"
+
+model:
+  quantization: "fp16"
+  backend: "trtllm"
+  env_vars:
+    # Point to the pre-compiled engine in S3 instead of HuggingFace model ID
+    OPTION_MODEL_ID: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/"
+    OPTION_ROLLING_BATCH: "trtllm"
+    OPTION_MAX_MODEL_LEN: "4096"
+    OPTION_TENSOR_PARALLEL_DEGREE: "1"
+    OPTION_DTYPE: "fp16"
+
+endpoint:
+  instance_type: "ml.g5.2xlarge"
+  instance_cost_per_hour: 1.52
+  container_startup_timeout: 600  # Pre-compiled engine loads in <5 min
+```
+
+**Key change:** `HF_MODEL_ID` → `OPTION_MODEL_ID` pointing to the S3 path with the
+pre-compiled engine. The container will download and load the engine directly instead
+of compiling from scratch.
+
+## Step 6 — Run the Benchmark
+
+```bash
+# Also need to update model_registry.py to include ModelDataUrl for trtllm
+# if using model.tar.gz approach — OR use the OPTION_MODEL_ID env var approach above
+
+conda run -n sagemaker-llm-optimizer python -m src.benchmark.runner \
+  --config configs/trtllm_fp16.yaml
+```
+
+## Step 7 — Cleanup
+
+```bash
+# Terminate the EC2 compilation instance
+aws ec2 terminate-instances --instance-ids i-xxx
+
+# Optionally delete the S3 engines if no longer needed
+# aws s3 rm s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/ --recursive
+```
+
+## Notes
+
+- The compiled engine is **GPU-architecture-specific**. A g5 (A10G) engine won't
+  work on p4d (A100) or p5 (H100) instances. Compile on the same GPU family you'll deploy to.
+- Engine compilation typically takes 20-40 min for a 7B model. Larger models take longer.
+- The `model.tar.gz` approach requires the SageMaker execution role to have `s3:GetObject`
+  on the engine bucket (already configured if you ran `setup_aws.sh`).
+- DJL 0.32.0 uses TensorRT-LLM 0.16.0. If you upgrade the DJL version, you may need to
+  recompile the engine.