Hi, thanks for releasing the code and results.
We are trying to reproduce the reported numbers, but we are seeing discrepancies on both Mantis and MIRB. The results match reasonably well for MUIR-Bench, so we wanted to check whether there is an implementation detail, preprocessing step, or evaluation setting that we may be missing.
Specifically, we get the following results:
| Run |
Mantis |
MIRB |
| baseline |
0.6959 |
0.5975 |
DTS (scale=10, layers 0–3) |
0.6544 |
0.5645 |
| Δ (DTS − B0) |
−4.15 pp |
−3.30 pp |
In our reproduction, DTS underperforms the B0 baseline by 4.15 pp on Mantis and 3.30 pp on MIRB. We have also tried different scaling factors (8 and 9) which produce the same results.
Experimental setup
Model
- Model:
Qwen/Qwen2.5-VL-7B-Instruct
- Precision:
bf16
- Attention backend:
attn_implementation=flash_attention_2
flash-attn==2.7.4.post1 installed in the delimscale conda environment
- Device map:
cuda
- Batch size:
1
- max_pixels:
614656
- Approximately
784 × 784
- Used to reduce the image-token count and avoid OOM on MIRB
- min_pixels: default value
Environment variables
export HF_HOME=/path/to/hf_cache
export TRANSFORMERS_CACHE=/path/to/hf_cache
export PYTHONPATH=/path/to/DelimScaling
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
B0 baseline runs
Scripts:
scripts/mantis/run_b0.sh
scripts/mirb/run_b0.sh
Command template:
CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
--model qwen2_5_vl \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
--tasks {mantis|mirb} \
--batch_size 1 \
--delim_scaling False \
--output_path $OUT_DIR \
--log_samples
This is a pure vanilla forward pass with no interventions.
DTS runs
Scripts:
Command template:
CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
--model qwen2_5_vl \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
--tasks {mantis|mirb} \
--batch_size 1 \
--delim_scaling True \
--scale 10 \
--select_layer 0,1,2,3 \
--output_path $OUT_DIR \
--log_samples
Data
Mantis
MIRB
Dataset: VLLMs/MIRB-hf
Split: test
Number of samples: 969
Access: gated
Could you clarify whether the reported DTS numbers for these two datasets were produced using the same decoding settings, image resolution / max_pixels, and layer selection? In particular, we wanted to check whether there are any dataset-specific preprocessing steps, evaluation settings, or generation kwargs needed to reproduce the reported values. All our experiments were conducted on A6000 GPUs.
Hi, thanks for releasing the code and results.
We are trying to reproduce the reported numbers, but we are seeing discrepancies on both Mantis and MIRB. The results match reasonably well for MUIR-Bench, so we wanted to check whether there is an implementation detail, preprocessing step, or evaluation setting that we may be missing.
Specifically, we get the following results:
scale=10, layers0–3)In our reproduction, DTS underperforms the B0 baseline by 4.15 pp on Mantis and 3.30 pp on MIRB. We have also tried different scaling factors (8 and 9) which produce the same results.
Experimental setup
Model
Qwen/Qwen2.5-VL-7B-Instructbf16attn_implementation=flash_attention_2flash-attn==2.7.4.post1installed in thedelimscaleconda environmentcuda1614656784 × 784256 × 28 × 28 = 200704Environment variables
B0 baseline runs
Scripts:
scripts/mantis/run_b0.shscripts/mirb/run_b0.shCommand template:
This is a pure vanilla forward pass with no interventions.
DTS runs
Scripts:
scripts/mantis/run_dts.shscripts/mirb/run_dts.shCommand template:
Data
Mantis
Dataset:
TIGER-Lab/Mantis-EvalSplit:
testNumber of samples:
217Access: gated, HF token required
MIRB
Dataset:
VLLMs/MIRB-hfSplit:
testNumber of samples:
969Access: gated
Could you clarify whether the reported DTS numbers for these two datasets were produced using the same decoding settings, image resolution /
max_pixels, and layer selection? In particular, we wanted to check whether there are any dataset-specific preprocessing steps, evaluation settings, or generation kwargs needed to reproduce the reported values. All our experiments were conducted on A6000 GPUs.