Skip to content

Kevinma0215/Qwen2-VL-TRT-Quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swift-VLM-Flow: Quantization Benchmarking of Qwen2-VL-2B on Edge GPUs

Systematic evaluation of seven TensorRT-LLM quantization configurations for deploying a 2B-parameter vision-language model on an NVIDIA RTX 5060 Ti (Blackwell), covering inference latency, VRAM footprint, and multimodal accuracy — with a focus on failure-mode diagnosis and actionable deployment guidance.


Key Findings

The most surprising result: The standard FP8 deployment recipe for text LLMs causes a silent, severe accuracy regression on VLMs (−8.8 pp VQAv2, −189 MME) that is fully recoverable by a one-flag change. The root cause — FP8's limited dynamic range saturating on visual token KV representations — is a configuration pitfall not documented in TRT-LLM's VLM guides and unlikely to surface in text-only evaluations.

  1. SmoothQuant (W8A8) dominates the accuracy–speed Pareto frontier. It achieves 2.00× TTFT and 2.13× decode speedup while improving over baseline accuracy (VQAv2 +0.5 pp, MME +21). W8A8 fused attention kernels reduce prefill latency beyond what weight-only methods achieve; outlier migration to weights reduces activation quantization noise.

  2. --kv_cache_dtype fp8 must not be used with VLMs. Visual token KV embeddings produced by the BF16 ViT encoder exceed FP8 E4M3's dynamic range (±448 vs BF16's ±65504), causing compounding attention errors across layers. Two ablations isolate the cause: removing FP8 FMHA has no effect; removing FP8 KV cache fully recovers accuracy. The most affected tasks are fine-grained visual matching: landmark (+10.5 pp recovered), posters (+13.6 pp), celebrity (+8.5 pp).

  3. Decode speedup tracks weight bitwidth predictably; TTFT improvement is bitwidth-independent. 4-bit tiers (INT4, INT4-AWQ, NVFP4) achieve 2.39–2.42× decode speedup, consistent with the theoretical 2× bandwidth reduction from halving bitwidth, confirming memory-bandwidth-bound decode. TTFT speedups are uniform across TRT tiers (1.63–2.12×) because prefill is compute-bound and benefits from graph fusion regardless of precision.

  4. INT4-AWQ causes catastrophic failure on text translation (−27.5 pp) while plain INT4 scores normally. This isolates the failure to AWQ's calibration step, not 4-bit quantization per se, consistent with a multilingual gap in the calibration corpus.

  5. NVFP4 (W4A8) degrades reasoning tasks disproportionately while preserving visual perception. commonsense_reasoning −13.5 pp, code_reasoning −12.5 pp; existence, position, and color degrade only 1–3 pp. Suggests W4A8 accumulation errors disrupt multi-hop consistency more than single-step visual recognition.


Benchmark Table

All results on NVIDIA RTX 5060 Ti (16 GB, Blackwell sm_120). FP8 results use the corrected configuration (--no_kv_fp8). Full per-task MME breakdown and FP8 ablation table in results/reports/final_report.md.

Configuration VQAv2 (%) POPE F1 (%) MME Total TTFT Speedup Decode Speedup Static VRAM (GB) Dyn VRAM (GB)
PyTorch BF16 (baseline) 82.3 87.4 1952 1.00× 1.00× 4.12 0.17
TRT BF16 82.3 88.0 1972 1.63× 1.47× 6.13 0.81
TRT INT8 (W8A16) 82.0 87.5 1959 1.64× 2.05× 5.06 0.79
TRT INT4 (W4A16) 79.1 87.0 1921 1.68× 2.39× 4.51 0.79
TRT SmoothQuant (W8A8) 82.8 87.4 1973 2.00× 2.13× 5.06 0.79
TRT FP8 (W8A8, BF16 KV) 82.0 87.8 1946 1.86× 2.03× 4.96 1.40
TRT INT4-AWQ (W4A16) 81.3 89.5 1862 1.64× 2.41× 4.52 0.79
TRT NVFP4 (W4A8) 81.6 86.4 1938 2.12× 2.42× 4.19 0.73

Deployment recommendations:

Priority Tier Decode Speedup VQAv2 Notes
Best accuracy–speed balance TRT SmoothQuant 2.13× 82.8% Lossless; best TTFT
Fastest + smallest footprint TRT NVFP4 2.42× 81.6% Blackwell only; avoid reasoning-critical tasks
Safe default, no calibration TRT INT8 2.05× 82.0% No calibration data needed
FP8 with lossless accuracy TRT FP8 (BF16 KV) 2.03× 82.0% Must use --no_kv_fp8; higher dyn VRAM (1.4 GB)

Methods

Model. Qwen2-VL-2B-Instruct (Alibaba, 2024): 2B-parameter VLM with a ViT image encoder and a Qwen2 LLM decoder. The vision encoder is always built in BF16; only the LLM decoder is quantized.

Quantization configs.

Method W×A Calibration Build Pipeline
BF16 W16A16 Pipeline A (convert_checkpoint.py)
INT8 W8A16 Pipeline A
INT4 W4A16 Pipeline A
SmoothQuant W8A8 Pipeline A
FP8 W8A8 Pipeline B (NVIDIA ModelOpt)
INT4-AWQ W4A16 Pipeline B
NVFP4 W4A8 Pipeline B (Blackwell only)

Efficiency benchmarks. 50 LLaVA-Bench samples. TTFT measured by running max_new_tokens=1 first; decode latency = (total_latency − TTFT) / output_tokens. VRAM measured via nvidia-smi delta (TRT-LLM allocates outside PyTorch's allocator).

Accuracy benchmarks. VQAv2 (500 validation samples, soft-match against 10 human annotations); POPE (500 × 3 splits, F1); MME (full ~2.8K samples, 14 perception + 4 cognition tasks).

Hardware. NVIDIA RTX 5060 Ti, 16 GB VRAM, Blackwell sm_120, CUDA 12.8.


Setup & Reproduction

Requirements

  • NVIDIA GPU, CUDA 12.8+
  • 16 GB VRAM recommended (Stage 3 vision encoder build peaks at ~20 GB system RAM)
  • FP8 requires Ada/Hopper/Blackwell; NVFP4 requires Blackwell
  • Docker with --gpus all support

Step 1 — Download model weights

pip install huggingface_hub
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct \
  --local-dir ./models/Qwen2-VL-2B-Instruct

Step 2 — Build Docker image

docker build -t vlm-bench:latest .

Based on nvcr.io/nvidia/tensorrt-llm/release:0.21.0. Build takes ~10–15 minutes.

Step 3 — Launch container

docker run -it --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v $(pwd):/workspace \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vlm-bench:latest bash

Step 4 — Build TRT engines (inside container)

# Pipeline A: no calibration required
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant bf16
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant smoothquant

# Pipeline B: calibration required (NVIDIA ModelOpt)
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant fp8 --no_kv_fp8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4_awq
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant nvfp4

FP8 note: --no_kv_fp8 is required for correct accuracy on Qwen2-VL. Using --kv_cache_dtype fp8 (the default text-LLM recipe) causes VQAv2 to drop −8.8 pp because visual token KV distributions exceed FP8's dynamic range. See results/reports/final_report.md §5.1 for the two-ablation diagnosis.

See scripts/README.md for per-stage build details and the OOM warning for Stage 3.

Step 5 — Run benchmarks

# Efficiency: TTFT, decode latency, VRAM (50 LLaVA-Bench samples)
cd /workspace/benchmark/efficiency && bash run_efficiency_all.sh

# Accuracy: VQAv2 (500), POPE (1500), MME (~2.8K)
cd /workspace/benchmark/accuracy && bash run_accuracy_all.sh

Step 6 — Generate report

cd /workspace/benchmark
python3 report.py --latest
# Output: results/reports/report_<timestamp>/report.md + img/*.png

Repository Layout

.
├── Dockerfile
├── benchmark/
│   ├── efficiency/            # TTFT, decode latency, VRAM benchmark
│   │   ├── config.py
│   │   ├── data_loader.py
│   │   ├── metrics.py
│   │   ├── run_benchmark_baseline.py
│   │   ├── run_benchmark_trt.py
│   │   └── run_efficiency_all.sh
│   ├── accuracy/              # VQAv2 / POPE / MME benchmark
│   │   ├── config_accuracy.py
│   │   ├── data_loader_acc.py
│   │   ├── accuracy_metrics.py
│   │   ├── run_accuracy_baseline.py
│   │   ├── run_accuracy_trt.py
│   │   └── run_accuracy_all.sh
│   └── report.py              # chart + Markdown report generator
├── scripts/
│   ├── build_trt_engines.sh   # TRT engine builder for all quant modes
│   └── README.md
├── models/                    # HF model weights (gitignored — see Step 1)
└── results/
    └── reports/               # tracked in git
        ├── final_report.md    # full written paper
        ├── final_report.pdf
        └── img/               # figures (speed, vram, accuracy, mme, tradeoff, fp8_ablation)

Limitations

  • Single GPU, single run. Results are on one RTX 5060 Ti; variance estimates are across-sample, not across-run.
  • Batch size 1 only. Throughput under batched inference would favour compute-bound tiers more strongly.
  • INT4-AWQ calibration corpus not inspected directly. The translation anomaly is attributed to a multilingual gap but the corpus was not analysed.
  • Next steps: batch inference profiling; multilingual AWQ calibration; extending the FP8 KV cache ablation to other VLM architectures.

Completed as part of UW Advanced ML coursework (CSE 599S, University of Washington).

About

Failure-mode diagnosis of TensorRT-LLM quantization on Qwen2-VL-2B. FP8 KV cache causes silent VLM regression; SmoothQuant is Pareto-optimal.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors