PriorTR/docs/CLSE.md at main · CodeChildCZJ/PriorTR

CLSE on PriorTR — Cross-Layer Spectral Evolution token pruning

Training-free, progressive visual-token pruning, integrated into PriorTR as a drop-in strategy (strategy=clse). Tokens are scored by how their spectral (frequency-domain) content evolves across decoder layers × text→image attention, then pruned progressively — three stages on the image backbones, a single stage on Video-LLaVA.

🧩 Part of PriorTR · unified runner · reference: CLSE

Backbone	wrapper (`--model`)	budget knob	env · transformers
LLaVA-1.5	`llava_vtr`	`keep_tokens` = 192 / 128 / 64	README · 4.37
Qwen2-VL-7B	`qwen2_vl_vtr`	`vtr_retain_ratio` = 0.334 / 0.223 / 0.112	README · 4.57
Qwen3-VL-8B	`qwen3_vl_vtr`	`vtr_retain_ratio` = 0.334 / 0.223 / 0.112	README · 5.2
Video-LLaVA-7B	native script	`--vtr_keep_tokens` (of 2048)	README · 4.37

⚙️ Setup

CLSE needs no extra packages. Build each backbone's standard PriorTR env and register its lmms-eval wrapper as described in its README — LLaVA · Qwen2-VL · Qwen3-VL — then strategy=clse is available immediately. (Qwen2-VL uses stock transformers with no patch; run its wrapper with PYTHONPATH=<repo>/image/Qwen2-VL.) Video-LLaVA runs through its own inference script (not lmms-eval) in the PriorTRvideollava env — --vtr_strategy clse is built in.

🚀 Running

Just strategy=clse + one budget knob. The 3 prune stages and the spectral reference layer auto-resolve from the model's decoder depth — you don't pass them.

# LLaVA-1.5      keep_tokens = 192 | 128 | 64
python -m lmms_eval --model llava_vtr \
  --model_args "pretrained=liuhaotian/llava-v1.5-7b,strategy=clse,keep_tokens=192" \
  --tasks gqa,mme,pope,textvqa_val --batch_size 1 --output_path ./results/llava_clse

# Qwen2-VL       vtr_retain_ratio = 0.334 | 0.223 | 0.112
PYTHONPATH=<repo>/image/Qwen2-VL python -m lmms_eval --model qwen2_vl_vtr \
  --model_args "pretrained=Qwen/Qwen2-VL-7B-Instruct,vtr_strategy=clse,vtr_retain_ratio=0.334" \
  --tasks mme,gqa --batch_size 1 --output_path ./results/qwen2_clse

# Qwen3-VL       same presets
python -m lmms_eval --model qwen3_vl_vtr \
  --model_args "pretrained=Qwen/Qwen3-VL-8B-Instruct,attn_implementation=sdpa,vtr_strategy=clse,vtr_retain_ratio=0.334" \
  --tasks mme,gqa --batch_size 1 --output_path ./results/qwen3_clse

# Video-LLaVA    native script (single budget knob: tokens of 2048); MSVD/MSRVTT/TGIF/ActivityNet
python videollava/eval/video/run_inference_video_qa.py --model_path LanguageBind/Video-LLaVA-7B \
  --cache_dir ./cache --video_dir $DATA/videos \
  --gt_file_question $DATA/test_q.json --gt_file_answers $DATA/test_a.json \
  --output_dir output/msvd_clse --output_name pred --num_samples 500 \
  --vtr_enabled --vtr_strategy clse --vtr_keep_tokens 64

Baseline (no pruning): enabled=False (LLaVA) / vtr_enabled=False (Qwen / Video-LLaVA). Override the auto layers only for a different model size: prune_layer=[1,11,21],ref_layers=[0] (LLaVA, brackets), vtr_prune_layer=1;10;19 (Qwen, ;), or --vtr_prune_layer 3 --vtr_ref_layers 2 (Video-LLaVA) — an explicit value always wins.

🎛️ Knobs

keep_tokens and vtr_retain_ratio are interchangeable on every backbone; each selects one hard-coded 3-stage keep schedule (the original CLSE budgets, floored per stage):

budget	`keep_tokens`	`vtr_retain_ratio`	LLaVA stages / 576	Qwen stages (of orig. len)
high	192	0.334	330 / 210 / 62	0.57 / 0.36 / 0.098
mid	128	0.223	220 / 140 / 41	0.38 / 0.24 / 0.066
low	64	0.112	110 / 70 / 20	0.19 / 0.12 / 0.034

Prune layers (auto, from decoder depth): LLaVA-32 [1,11,21] · Qwen2-VL-28 [1,10,19] · Qwen3-VL-36 [1,13,24]. The first two are the original CLSE schedules; Qwen3-VL (no upstream CLSE) is our depth-aligned port. Override with prune_layer / vtr_prune_layer for other sizes.

Video-LLaVA is single-stage (the original CLSE video schedule): prune once at [3] after snapshotting the spectral reference at ref_layers=[2], keeping --vtr_keep_tokens of the 2048-token 8×16×16 grid. The spectral term uses a per-frame 2D FFT (temporal axis folded into the batch; sparse 8-frame video aliases under a 3D FFT, so cross-layer evolution carries the temporal signal — set clse_fft_3d=True to try the 3D variant). Both auto-resolve from --vtr_strategy clse alone.

Spectral hyper-params clse_cutoff_ratio / clse_temp (both 0.1) are config-tunable (vtr_clse_* on Qwen, --vtr_clse_* on Video-LLaVA); other choices (score = evolution × attention, spectral only at the first stage) are fixed in strategy/clse.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLSE on PriorTR — Cross-Layer Spectral Evolution token pruning

⚙️ Setup

🚀 Running

🎛️ Knobs

📄 Credit

FilesExpand file tree

CLSE.md

Latest commit

History

CLSE.md

File metadata and controls

CLSE on PriorTR — Cross-Layer Spectral Evolution token pruning

⚙️ Setup

🚀 Running

🎛️ Knobs

📄 Credit