Systematic evaluation of seven TensorRT-LLM quantization configurations for deploying a 2B-parameter vision-language model on an NVIDIA RTX 5060 Ti (Blackwell), covering inference latency, VRAM footprint, and multimodal accuracy — with a focus on failure-mode diagnosis and actionable deployment guidance.
The most surprising result: The standard FP8 deployment recipe for text LLMs causes a silent, severe accuracy regression on VLMs (−8.8 pp VQAv2, −189 MME) that is fully recoverable by a one-flag change. The root cause — FP8's limited dynamic range saturating on visual token KV representations — is a configuration pitfall not documented in TRT-LLM's VLM guides and unlikely to surface in text-only evaluations.
-
SmoothQuant (W8A8) dominates the accuracy–speed Pareto frontier. It achieves 2.00× TTFT and 2.13× decode speedup while improving over baseline accuracy (VQAv2 +0.5 pp, MME +21). W8A8 fused attention kernels reduce prefill latency beyond what weight-only methods achieve; outlier migration to weights reduces activation quantization noise.
-
--kv_cache_dtype fp8must not be used with VLMs. Visual token KV embeddings produced by the BF16 ViT encoder exceed FP8 E4M3's dynamic range (±448 vs BF16's ±65504), causing compounding attention errors across layers. Two ablations isolate the cause: removing FP8 FMHA has no effect; removing FP8 KV cache fully recovers accuracy. The most affected tasks are fine-grained visual matching:landmark(+10.5 pp recovered),posters(+13.6 pp),celebrity(+8.5 pp). -
Decode speedup tracks weight bitwidth predictably; TTFT improvement is bitwidth-independent. 4-bit tiers (INT4, INT4-AWQ, NVFP4) achieve 2.39–2.42× decode speedup, consistent with the theoretical 2× bandwidth reduction from halving bitwidth, confirming memory-bandwidth-bound decode. TTFT speedups are uniform across TRT tiers (1.63–2.12×) because prefill is compute-bound and benefits from graph fusion regardless of precision.
-
INT4-AWQ causes catastrophic failure on text translation (−27.5 pp) while plain INT4 scores normally. This isolates the failure to AWQ's calibration step, not 4-bit quantization per se, consistent with a multilingual gap in the calibration corpus.
-
NVFP4 (W4A8) degrades reasoning tasks disproportionately while preserving visual perception.
commonsense_reasoning−13.5 pp,code_reasoning−12.5 pp;existence,position, andcolordegrade only 1–3 pp. Suggests W4A8 accumulation errors disrupt multi-hop consistency more than single-step visual recognition.
All results on NVIDIA RTX 5060 Ti (16 GB, Blackwell sm_120). FP8 results use the corrected configuration (--no_kv_fp8). Full per-task MME breakdown and FP8 ablation table in results/reports/final_report.md.
| Configuration | VQAv2 (%) | POPE F1 (%) | MME Total | TTFT Speedup | Decode Speedup | Static VRAM (GB) | Dyn VRAM (GB) |
|---|---|---|---|---|---|---|---|
| PyTorch BF16 (baseline) | 82.3 | 87.4 | 1952 | 1.00× | 1.00× | 4.12 | 0.17 |
| TRT BF16 | 82.3 | 88.0 | 1972 | 1.63× | 1.47× | 6.13 | 0.81 |
| TRT INT8 (W8A16) | 82.0 | 87.5 | 1959 | 1.64× | 2.05× | 5.06 | 0.79 |
| TRT INT4 (W4A16) | 79.1 | 87.0 | 1921 | 1.68× | 2.39× | 4.51 | 0.79 |
| TRT SmoothQuant (W8A8) | 82.8 | 87.4 | 1973 | 2.00× | 2.13× | 5.06 | 0.79 |
| TRT FP8 (W8A8, BF16 KV) | 82.0 | 87.8 | 1946 | 1.86× | 2.03× | 4.96 | 1.40 |
| TRT INT4-AWQ (W4A16) | 81.3 | 89.5 | 1862 | 1.64× | 2.41× | 4.52 | 0.79 |
| TRT NVFP4 (W4A8) | 81.6 | 86.4 | 1938 | 2.12× | 2.42× | 4.19 | 0.73 |
Deployment recommendations:
| Priority | Tier | Decode Speedup | VQAv2 | Notes |
|---|---|---|---|---|
| Best accuracy–speed balance | TRT SmoothQuant | 2.13× | 82.8% | Lossless; best TTFT |
| Fastest + smallest footprint | TRT NVFP4 | 2.42× | 81.6% | Blackwell only; avoid reasoning-critical tasks |
| Safe default, no calibration | TRT INT8 | 2.05× | 82.0% | No calibration data needed |
| FP8 with lossless accuracy | TRT FP8 (BF16 KV) | 2.03× | 82.0% | Must use --no_kv_fp8; higher dyn VRAM (1.4 GB) |
Model. Qwen2-VL-2B-Instruct (Alibaba, 2024): 2B-parameter VLM with a ViT image encoder and a Qwen2 LLM decoder. The vision encoder is always built in BF16; only the LLM decoder is quantized.
Quantization configs.
| Method | W×A | Calibration | Build Pipeline |
|---|---|---|---|
| BF16 | W16A16 | — | Pipeline A (convert_checkpoint.py) |
| INT8 | W8A16 | — | Pipeline A |
| INT4 | W4A16 | — | Pipeline A |
| SmoothQuant | W8A8 | ✓ | Pipeline A |
| FP8 | W8A8 | ✓ | Pipeline B (NVIDIA ModelOpt) |
| INT4-AWQ | W4A16 | ✓ | Pipeline B |
| NVFP4 | W4A8 | ✓ | Pipeline B (Blackwell only) |
Efficiency benchmarks. 50 LLaVA-Bench samples. TTFT measured by running max_new_tokens=1 first; decode latency = (total_latency − TTFT) / output_tokens. VRAM measured via nvidia-smi delta (TRT-LLM allocates outside PyTorch's allocator).
Accuracy benchmarks. VQAv2 (500 validation samples, soft-match against 10 human annotations); POPE (500 × 3 splits, F1); MME (full ~2.8K samples, 14 perception + 4 cognition tasks).
Hardware. NVIDIA RTX 5060 Ti, 16 GB VRAM, Blackwell sm_120, CUDA 12.8.
- NVIDIA GPU, CUDA 12.8+
- 16 GB VRAM recommended (Stage 3 vision encoder build peaks at ~20 GB system RAM)
- FP8 requires Ada/Hopper/Blackwell; NVFP4 requires Blackwell
- Docker with
--gpus allsupport
pip install huggingface_hub
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct \
--local-dir ./models/Qwen2-VL-2B-Instructdocker build -t vlm-bench:latest .Based on nvcr.io/nvidia/tensorrt-llm/release:0.21.0. Build takes ~10–15 minutes.
docker run -it --gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v $(pwd):/workspace \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vlm-bench:latest bash# Pipeline A: no calibration required
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant bf16
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant smoothquant
# Pipeline B: calibration required (NVIDIA ModelOpt)
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant fp8 --no_kv_fp8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4_awq
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant nvfp4FP8 note:
--no_kv_fp8is required for correct accuracy on Qwen2-VL. Using--kv_cache_dtype fp8(the default text-LLM recipe) causes VQAv2 to drop −8.8 pp because visual token KV distributions exceed FP8's dynamic range. Seeresults/reports/final_report.md §5.1for the two-ablation diagnosis.
See scripts/README.md for per-stage build details and the OOM warning for Stage 3.
# Efficiency: TTFT, decode latency, VRAM (50 LLaVA-Bench samples)
cd /workspace/benchmark/efficiency && bash run_efficiency_all.sh
# Accuracy: VQAv2 (500), POPE (1500), MME (~2.8K)
cd /workspace/benchmark/accuracy && bash run_accuracy_all.shcd /workspace/benchmark
python3 report.py --latest
# Output: results/reports/report_<timestamp>/report.md + img/*.png.
├── Dockerfile
├── benchmark/
│ ├── efficiency/ # TTFT, decode latency, VRAM benchmark
│ │ ├── config.py
│ │ ├── data_loader.py
│ │ ├── metrics.py
│ │ ├── run_benchmark_baseline.py
│ │ ├── run_benchmark_trt.py
│ │ └── run_efficiency_all.sh
│ ├── accuracy/ # VQAv2 / POPE / MME benchmark
│ │ ├── config_accuracy.py
│ │ ├── data_loader_acc.py
│ │ ├── accuracy_metrics.py
│ │ ├── run_accuracy_baseline.py
│ │ ├── run_accuracy_trt.py
│ │ └── run_accuracy_all.sh
│ └── report.py # chart + Markdown report generator
├── scripts/
│ ├── build_trt_engines.sh # TRT engine builder for all quant modes
│ └── README.md
├── models/ # HF model weights (gitignored — see Step 1)
└── results/
└── reports/ # tracked in git
├── final_report.md # full written paper
├── final_report.pdf
└── img/ # figures (speed, vram, accuracy, mme, tradeoff, fp8_ablation)
- Single GPU, single run. Results are on one RTX 5060 Ti; variance estimates are across-sample, not across-run.
- Batch size 1 only. Throughput under batched inference would favour compute-bound tiers more strongly.
- INT4-AWQ calibration corpus not inspected directly. The translation anomaly is attributed to a multilingual gap but the corpus was not analysed.
- Next steps: batch inference profiling; multilingual AWQ calibration; extending the FP8 KV cache ablation to other VLM architectures.
Completed as part of UW Advanced ML coursework (CSE 599S, University of Washington).