Skip to content

Latest commit

 

History

History
196 lines (158 loc) · 6.11 KB

File metadata and controls

196 lines (158 loc) · 6.11 KB

TRT-LLM Pre-compiled Engine Guide

TRT-LLM compiles a CUDA engine at startup, which takes 20-40 minutes and exceeds SageMaker's health check timeout. The solution: compile once offline, upload to S3, and point the endpoint at the cached engine.

Prerequisites

  • AWS CLI configured with your credentials
  • A GPU instance (EC2 g5.2xlarge or larger) for compilation
  • Your S3 bucket: $SAGEMAKER_S3_BUCKET
  • HuggingFace token for gated models: $HF_TOKEN

Step 1 — Launch an EC2 GPU Instance

# Launch a g5.2xlarge (same GPU arch as SageMaker g5 instances)
aws ec2 run-instances \
  --image-id ami-0a0e5d9c7acc336f1 \
  --instance-type g5.2xlarge \
  --key-name your-key \
  --security-group-ids sg-xxx \
  --subnet-id subnet-xxx \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=trtllm-compile}]'

SSH into the instance once it's running.

Step 2 — Pull the DJL TensorRT-LLM Container

# Get the same container image SageMaker uses
# Replace REGION and ACCOUNT with your DJL image registry
# You can find the exact URI by running locally:
#   python -c "from sagemaker import image_uris; print(image_uris.retrieve('djl-tensorrtllm', 'us-east-1', version='0.32.0'))"

IMAGE="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-tensorrtllm0.16.0-cu128-full"

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
docker pull $IMAGE

Step 3 — Compile the TRT-LLM Engine

FP16 Engine

docker run --gpus all --rm \
  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
  -e HF_TOKEN=$HF_TOKEN \
  $IMAGE \
  python /opt/djl/partition/trt_llm_partition.py \
    --properties_dir /dev/null \
    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --model_path mistralai/Mistral-7B-Instruct-v0.3 \
    --dtype fp16 \
    --max_input_len 4096 \
    --max_num_tokens 16384

If the above doesn't work with /dev/null as properties_dir, create a properties file:

mkdir -p /tmp/trtllm-props
cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF

docker run --gpus all --rm \
  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
  -v /tmp/trtllm-props:/tmp/trtllm-props \
  -e HF_TOKEN=$HF_TOKEN \
  $IMAGE \
  python /opt/djl/partition/trt_llm_partition.py \
    --properties_dir /tmp/trtllm-props \
    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-fp16 \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --model_path mistralai/Mistral-7B-Instruct-v0.3

AWQ-INT4 Engine

Same as above but add --quantize awq:

cat > /tmp/trtllm-props/serving.properties << 'EOF'
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.3
option.rolling_batch=trtllm
option.tensor_parallel_degree=1
option.dtype=fp16
option.quantize=awq
option.max_model_len=4096
option.max_rolling_batch_size=256
option.max_num_tokens=16384
EOF

docker run --gpus all --rm \
  -v /tmp/trtllm-cache:/tmp/trtllm-cache \
  -v /tmp/trtllm-props:/tmp/trtllm-props \
  -e HF_TOKEN=$HF_TOKEN \
  $IMAGE \
  python /opt/djl/partition/trt_llm_partition.py \
    --properties_dir /tmp/trtllm-props \
    --trt_llm_model_repo /tmp/trtllm-cache/mistral-7b-awq \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --model_path mistralai/Mistral-7B-Instruct-v0.3

Step 4 — Package and Upload to S3

# Package the compiled engine into a model.tar.gz
cd /tmp/trtllm-cache/mistral-7b-fp16
tar czf /tmp/mistral-7b-fp16-trtllm.tar.gz .

# Upload to S3
aws s3 cp /tmp/mistral-7b-fp16-trtllm.tar.gz \
  s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-fp16/model.tar.gz

# Repeat for AWQ if compiled
cd /tmp/trtllm-cache/mistral-7b-awq
tar czf /tmp/mistral-7b-awq-trtllm.tar.gz .
aws s3 cp /tmp/mistral-7b-awq-trtllm.tar.gz \
  s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/mistral-7b-awq/model.tar.gz

Step 5 — Update Config to Use Pre-compiled Engine

Modify configs/trtllm_fp16.yaml to point at the S3 engine:

name: "trtllm-mistral7b-fp16-g5"

model:
  quantization: "fp16"
  backend: "trtllm"
  env_vars:
    # Point to the pre-compiled engine in S3 instead of HuggingFace model ID
    OPTION_MODEL_ID: "s3://<your-bucket>/trtllm-engines/mistral-7b-fp16/"
    OPTION_ROLLING_BATCH: "trtllm"
    OPTION_MAX_MODEL_LEN: "4096"
    OPTION_TENSOR_PARALLEL_DEGREE: "1"
    OPTION_DTYPE: "fp16"

endpoint:
  instance_type: "ml.g5.2xlarge"
  instance_cost_per_hour: 1.52
  container_startup_timeout: 600  # Pre-compiled engine loads in <5 min

Key change: HF_MODEL_IDOPTION_MODEL_ID pointing to the S3 path with the pre-compiled engine. The container will download and load the engine directly instead of compiling from scratch.

Step 6 — Run the Benchmark

# Also need to update model_registry.py to include ModelDataUrl for trtllm
# if using model.tar.gz approach — OR use the OPTION_MODEL_ID env var approach above

conda run -n sagemaker-llm-optimizer python -m src.benchmark.runner \
  --config configs/trtllm_fp16.yaml

Step 7 — Cleanup

# Terminate the EC2 compilation instance
aws ec2 terminate-instances --instance-ids i-xxx

# Optionally delete the S3 engines if no longer needed
# aws s3 rm s3://$SAGEMAKER_S3_BUCKET/trtllm-engines/ --recursive

Notes

  • The compiled engine is GPU-architecture-specific. A g5 (A10G) engine won't work on p4d (A100) or p5 (H100) instances. Compile on the same GPU family you'll deploy to.
  • Engine compilation typically takes 20-40 min for a 7B model. Larger models take longer.
  • The model.tar.gz approach requires the SageMaker execution role to have s3:GetObject on the engine bucket (already configured if you ran setup_aws.sh).
  • DJL 0.32.0 uses TensorRT-LLM 0.16.0. If you upgrade the DJL version, you may need to recompile the engine.