TRT-LLM compiles a CUDA engine at startup, which takes 20-40 minutes and exceeds SageMaker's health check timeout. The solution: compile once offline, upload to S3, and point the endpoint at the cached engine.
- AWS CLI configured with your credentials
- A GPU instance (EC2 g5.2xlarge or larger) for compilation
- Your S3 bucket:
$SAGEMAKER_S3_BUCKET - HuggingFace token for gated models:
$HF_TOKEN
# Launch a g5.2xlarge (same GPU arch as SageMaker g5 instances)
aws ec2 run-instances \
--image-id ami-0a0e5d9c7acc336f1 \
--instance-type g5.2xlarge \
--key-name your-key \
--security-group-ids sg-xxx \
--subnet-id subnet-xxx \
--block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=trtllm-compile}]'SSH into the instance once it's running.
# Get the same container image SageMaker uses
# Replace REGION and ACCOUNT with your DJL image registry
# You can find the exact URI by running locally:
# python -c "from sagemaker import image_uris; print(image_uris.retrieve('djl-tensorrtllm', 'us-east-1', version='0.32.0'))"
IMAGE="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-tensorrtllm0.16.0-cu128-full"
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
docker pull $IMAGEdocker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /dev/null \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3 \
--dtype fp16 \
--max_input_len 4096 \
--max_num_tokens 16384If the above doesn't work with /dev/null as properties_dir, create a properties file:
mkdir -p /tmp/trtllm-props
cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF
docker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-v /tmp/trtllm-props:/tmp/trtllm-props \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /tmp/trtllm-props \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3Same as above but add --quantize awq:
cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.quantize=awq
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF
docker run --gpus all --rm \
-v /tmp/trtllm-cache:/tmp/trtllm-cache \
-v /tmp/trtllm-props:/tmp/trtllm-props \
-e HF_TOKEN=$HF_TOKEN \
$IMAGE \
python /opt/djl/partition/trt_llm_partition.py \
--properties_dir /tmp/trtllm-props \
--trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-awq \
--tensor_parallel_degree 1 \
--pipeline_parallel_degree 1 \
--model_path mistralai/Mistral-7B-Instruct-v0.3# Package the compiled engine into a model.tar.gz
cd /tmp/trtllm-cache/mistral-7b-fp16
tar czf /tmp/mistral-7b-fp16-trtllm.tar.gz .
# Upload to S3
aws s3 cp /tmp/mistral-7b-fp16-trtllm.tar.gz \
s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-fp16/model.tar.gz
# Repeat for AWQ if compiled
cd /tmp/trtllm-cache/mistral-7b-awq
tar czf /tmp/mistral-7b-awq-trtllm.tar.gz .
aws s3 cp /tmp/mistral-7b-awq-trtllm.tar.gz \
s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-awq/model.tar.gzModify configs/trtllm_fp16.yaml to point at the S3 engine:
name: "trtllm-mistral7b-fp16-g5"
model:
quantization: "fp16"
backend: "trtllm"
env_vars:
# Point to the pre-compiled engine in S3 instead of HuggingFace model ID
OPTION_MODEL_ID: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/"
OPTION_ROLLING_BATCH: "trtllm"
OPTION_MAX_MODEL_LEN: "4096"
OPTION_TENSOR_PARALLEL_DEGREE: "1"
OPTION_DTYPE: "fp16"
endpoint:
instance_type: "ml.g5.2xlarge"
instance_cost_per_hour: 1.52
container_startup_timeout: 600 # Pre-compiled engine loads in <5 minKey change: HF_MODEL_ID → OPTION_MODEL_ID pointing to the S3 path with the
pre-compiled engine. The container will download and load the engine directly instead
of compiling from scratch.
# Also need to update model_registry.py to include ModelDataUrl for trtllm
# if using model.tar.gz approach — OR use the OPTION_MODEL_ID env var approach above
conda run -n sagemaker-llm-optimizer python -m src.benchmark.runner \
--config configs/trtllm_fp16.yaml# Terminate the EC2 compilation instance
aws ec2 terminate-instances --instance-ids i-xxx
# Optionally delete the S3 engines if no longer needed
# aws s3 rm s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/ --recursive- The compiled engine is GPU-architecture-specific. A g5 (A10G) engine won't work on p4d (A100) or p5 (H100) instances. Compile on the same GPU family you'll deploy to.
- Engine compilation typically takes 20-40 min for a 7B model. Larger models take longer.
- The
model.tar.gzapproach requires the SageMaker execution role to haves3:GetObjecton the engine bucket (already configured if you ransetup_aws.sh). - DJL 0.32.0 uses TensorRT-LLM 0.16.0. If you upgrade the DJL version, you may need to recompile the engine.