Swift-VLM-Flow: Quantization Benchmarking of Qwen2-VL-2B on Edge GPUs

Systematic evaluation of seven TensorRT-LLM quantization configurations for deploying a 2B-parameter vision-language model on an NVIDIA RTX 5060 Ti (Blackwell), covering inference latency, VRAM footprint, and multimodal accuracy — with a focus on failure-mode diagnosis and actionable deployment guidance.

Key Findings

The most surprising result: The standard FP8 deployment recipe for text LLMs causes a silent, severe accuracy regression on VLMs (−8.8 pp VQAv2, −189 MME) that is fully recoverable by a one-flag change. The root cause — FP8's limited dynamic range saturating on visual token KV representations — is a configuration pitfall not documented in TRT-LLM's VLM guides and unlikely to surface in text-only evaluations.

SmoothQuant (W8A8) dominates the accuracy–speed Pareto frontier. It achieves 2.00× TTFT and 2.13× decode speedup while improving over baseline accuracy (VQAv2 +0.5 pp, MME +21). W8A8 fused attention kernels reduce prefill latency beyond what weight-only methods achieve; outlier migration to weights reduces activation quantization noise.
--kv_cache_dtype fp8 must not be used with VLMs. Visual token KV embeddings produced by the BF16 ViT encoder exceed FP8 E4M3's dynamic range (±448 vs BF16's ±65504), causing compounding attention errors across layers. Two ablations isolate the cause: removing FP8 FMHA has no effect; removing FP8 KV cache fully recovers accuracy. The most affected tasks are fine-grained visual matching: landmark (+10.5 pp recovered), posters (+13.6 pp), celebrity (+8.5 pp).
Decode speedup tracks weight bitwidth predictably; TTFT improvement is bitwidth-independent. 4-bit tiers (INT4, INT4-AWQ, NVFP4) achieve 2.39–2.42× decode speedup, consistent with the theoretical 2× bandwidth reduction from halving bitwidth, confirming memory-bandwidth-bound decode. TTFT speedups are uniform across TRT tiers (1.63–2.12×) because prefill is compute-bound and benefits from graph fusion regardless of precision.
INT4-AWQ causes catastrophic failure on text translation (−27.5 pp) while plain INT4 scores normally. This isolates the failure to AWQ's calibration step, not 4-bit quantization per se, consistent with a multilingual gap in the calibration corpus.
NVFP4 (W4A8) degrades reasoning tasks disproportionately while preserving visual perception. commonsense_reasoning −13.5 pp, code_reasoning −12.5 pp; existence, position, and color degrade only 1–3 pp. Suggests W4A8 accumulation errors disrupt multi-hop consistency more than single-step visual recognition.

Benchmark Table

All results on NVIDIA RTX 5060 Ti (16 GB, Blackwell sm_120). FP8 results use the corrected configuration (--no_kv_fp8). Full per-task MME breakdown and FP8 ablation table in results/reports/final_report.md.

Configuration	VQAv2 (%)	POPE F1 (%)	MME Total	TTFT Speedup	Decode Speedup	Static VRAM (GB)	Dyn VRAM (GB)
PyTorch BF16 (baseline)	82.3	87.4	1952	1.00×	1.00×	4.12	0.17
TRT BF16	82.3	88.0	1972	1.63×	1.47×	6.13	0.81
TRT INT8 (W8A16)	82.0	87.5	1959	1.64×	2.05×	5.06	0.79
TRT INT4 (W4A16)	79.1	87.0	1921	1.68×	2.39×	4.51	0.79
TRT SmoothQuant (W8A8)	82.8	87.4	1973	2.00×	2.13×	5.06	0.79
TRT FP8 (W8A8, BF16 KV)	82.0	87.8	1946	1.86×	2.03×	4.96	1.40
TRT INT4-AWQ (W4A16)	81.3	89.5	1862	1.64×	2.41×	4.52	0.79
TRT NVFP4 (W4A8)	81.6	86.4	1938	2.12×	2.42×	4.19	0.73

Deployment recommendations:

Priority	Tier	Decode Speedup	VQAv2	Notes
Best accuracy–speed balance	TRT SmoothQuant	2.13×	82.8%	Lossless; best TTFT
Fastest + smallest footprint	TRT NVFP4	2.42×	81.6%	Blackwell only; avoid reasoning-critical tasks
Safe default, no calibration	TRT INT8	2.05×	82.0%	No calibration data needed
FP8 with lossless accuracy	TRT FP8 (BF16 KV)	2.03×	82.0%	Must use `--no_kv_fp8`; higher dyn VRAM (1.4 GB)

Methods

Model. Qwen2-VL-2B-Instruct (Alibaba, 2024): 2B-parameter VLM with a ViT image encoder and a Qwen2 LLM decoder. The vision encoder is always built in BF16; only the LLM decoder is quantized.

Quantization configs.

Method	W×A	Calibration	Build Pipeline
BF16	W16A16	—	Pipeline A (`convert_checkpoint.py`)
INT8	W8A16	—	Pipeline A
INT4	W4A16	—	Pipeline A
SmoothQuant	W8A8	✓	Pipeline A
FP8	W8A8	✓	Pipeline B (NVIDIA ModelOpt)
INT4-AWQ	W4A16	✓	Pipeline B
NVFP4	W4A8	✓	Pipeline B (Blackwell only)

Efficiency benchmarks. 50 LLaVA-Bench samples. TTFT measured by running max_new_tokens=1 first; decode latency = (total_latency − TTFT) / output_tokens. VRAM measured via nvidia-smi delta (TRT-LLM allocates outside PyTorch's allocator).

Accuracy benchmarks. VQAv2 (500 validation samples, soft-match against 10 human annotations); POPE (500 × 3 splits, F1); MME (full ~2.8K samples, 14 perception + 4 cognition tasks).

Hardware. NVIDIA RTX 5060 Ti, 16 GB VRAM, Blackwell sm_120, CUDA 12.8.

Setup & Reproduction

Requirements

NVIDIA GPU, CUDA 12.8+
16 GB VRAM recommended (Stage 3 vision encoder build peaks at ~20 GB system RAM)
FP8 requires Ada/Hopper/Blackwell; NVFP4 requires Blackwell
Docker with --gpus all support

Step 1 — Download model weights

pip install huggingface_hub
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct \
  --local-dir ./models/Qwen2-VL-2B-Instruct

Step 2 — Build Docker image

docker build -t vlm-bench:latest .

Based on nvcr.io/nvidia/tensorrt-llm/release:0.21.0. Build takes ~10–15 minutes.

Step 3 — Launch container

docker run -it --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v $(pwd):/workspace \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vlm-bench:latest bash

Step 4 — Build TRT engines (inside container)

# Pipeline A: no calibration required
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant bf16
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant smoothquant

# Pipeline B: calibration required (NVIDIA ModelOpt)
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant fp8 --no_kv_fp8
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant int4_awq
bash /workspace/scripts/build_trt_engines.sh --model qwen2vl_2b --quant nvfp4

FP8 note: --no_kv_fp8 is required for correct accuracy on Qwen2-VL. Using --kv_cache_dtype fp8 (the default text-LLM recipe) causes VQAv2 to drop −8.8 pp because visual token KV distributions exceed FP8's dynamic range. See results/reports/final_report.md §5.1 for the two-ablation diagnosis.

See scripts/README.md for per-stage build details and the OOM warning for Stage 3.

Step 5 — Run benchmarks

# Efficiency: TTFT, decode latency, VRAM (50 LLaVA-Bench samples)
cd /workspace/benchmark/efficiency && bash run_efficiency_all.sh

# Accuracy: VQAv2 (500), POPE (1500), MME (~2.8K)
cd /workspace/benchmark/accuracy && bash run_accuracy_all.sh

Step 6 — Generate report

cd /workspace/benchmark
python3 report.py --latest
# Output: results/reports/report_<timestamp>/report.md + img/*.png

Repository Layout

.
├── Dockerfile
├── benchmark/
│   ├── efficiency/            # TTFT, decode latency, VRAM benchmark
│   │   ├── config.py
│   │   ├── data_loader.py
│   │   ├── metrics.py
│   │   ├── run_benchmark_baseline.py
│   │   ├── run_benchmark_trt.py
│   │   └── run_efficiency_all.sh
│   ├── accuracy/              # VQAv2 / POPE / MME benchmark
│   │   ├── config_accuracy.py
│   │   ├── data_loader_acc.py
│   │   ├── accuracy_metrics.py
│   │   ├── run_accuracy_baseline.py
│   │   ├── run_accuracy_trt.py
│   │   └── run_accuracy_all.sh
│   └── report.py              # chart + Markdown report generator
├── scripts/
│   ├── build_trt_engines.sh   # TRT engine builder for all quant modes
│   └── README.md
├── models/                    # HF model weights (gitignored — see Step 1)
└── results/
    └── reports/               # tracked in git
        ├── final_report.md    # full written paper
        ├── final_report.pdf
        └── img/               # figures (speed, vram, accuracy, mme, tradeoff, fp8_ablation)

Limitations

Single GPU, single run. Results are on one RTX 5060 Ti; variance estimates are across-sample, not across-run.
Batch size 1 only. Throughput under batched inference would favour compute-bound tiers more strongly.
INT4-AWQ calibration corpus not inspected directly. The translation anomaly is attributed to a multilingual gap but the corpus was not analysed.
Next steps: batch inference profiling; multilingual AWQ calibration; extending the FP8 KV cache ablation to other VLM architectures.

Completed as part of UW Advanced ML coursework (CSE 599S, University of Washington).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swift-VLM-Flow: Quantization Benchmarking of Qwen2-VL-2B on Edge GPUs

Key Findings

Benchmark Table

Methods

Setup & Reproduction

Requirements

Step 1 — Download model weights

Step 2 — Build Docker image

Step 3 — Launch container

Step 4 — Build TRT engines (inside container)

Step 5 — Run benchmarks

Step 6 — Generate report

Repository Layout

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
models		models
results/reports		results/reports
scripts		scripts
test_assets		test_assets
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Swift-VLM-Flow: Quantization Benchmarking of Qwen2-VL-2B on Edge GPUs

Key Findings

Benchmark Table

Methods

Setup & Reproduction

Requirements

Step 1 — Download model weights

Step 2 — Build Docker image

Step 3 — Launch container

Step 4 — Build TRT engines (inside container)

Step 5 — Run benchmarks

Step 6 — Generate report

Repository Layout

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages