Results mismatch on Mantis, and MIRB

Hi, thanks for releasing the code and results.

We are trying to reproduce the reported numbers, but we are seeing discrepancies on both Mantis and MIRB. The results match reasonably well for MUIR-Bench, so we wanted to check whether there is an implementation detail, preprocessing step, or evaluation setting that we may be missing.

Specifically, we get the following results:

| Run | Mantis | MIRB |
|---|---:|---:|
| baseline | 0.6959 | 0.5975 |
| DTS (`scale=10`, layers `0–3`) | 0.6544 | 0.5645 |
| Δ (DTS − B0) | −4.15 pp | −3.30 pp |

In our reproduction, DTS underperforms the B0 baseline by **4.15 pp** on Mantis and **3.30 pp** on MIRB. We have also tried different scaling factors (8 and 9) which produce the same results.

## Experimental setup

### Model

- **Model:** `Qwen/Qwen2.5-VL-7B-Instruct`
- **Precision:** `bf16`
- **Attention backend:** `attn_implementation=flash_attention_2`
  - `flash-attn==2.7.4.post1` installed in the `delimscale` conda environment
- **Device map:** `cuda`
  - Single-GPU run
- **Batch size:** `1`
- **max_pixels:** `614656`
  - Approximately `784 × 784`
  - Used to reduce the image-token count and avoid OOM on MIRB
- **min_pixels:** default value
  - `256 × 28 × 28 = 200704`

<h2 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Environment variables</h2><pre style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><code class="language-bash">export HF_HOME=/path/to/hf_cache
export TRANSFORMERS_CACHE=/path/to/hf_cache
export PYTHONPATH=/path/to/DelimScaling
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
</code></pre><hr style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid;"><h2 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">B0 baseline runs</h2><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Scripts:</p><ul style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><li><p><code inline="">scripts/mantis/run_b0.sh</code></p></li><li><p><code inline="">scripts/mirb/run_b0.sh</code></p></li></ul><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Command template:</p><pre style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><code class="language-bash">CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
  --tasks {mantis|mirb} \
  --batch_size 1 \
  --delim_scaling False \
  --output_path $OUT_DIR \
  --log_samples
</code></pre><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">This is a pure vanilla forward pass with no interventions.</p><hr style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid;"><h2 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">DTS runs</h2><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Scripts:</p><ul style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><li><p><code inline="">scripts/mantis/run_dts.sh</code></p></li><li><p><code inline="">scripts/mirb/run_dts.sh</code></p></li></ul><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Command template:</p><pre style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><code class="language-bash">CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
  --tasks {mantis|mirb} \
  --batch_size 1 \
  --delim_scaling True \
  --scale 10 \
  --select_layer 0,1,2,3 \
  --output_path $OUT_DIR \
  --log_samples
</code></pre><hr style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid;"><h2 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Data</h2><h3 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Mantis</h3><ul style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><li><p>Dataset:<span class="Apple-converted-space"> </span><code inline="">TIGER-Lab/Mantis-Eval</code></p></li><li><p>Split:<span class="Apple-converted-space"> </span><code inline="">test</code></p></li><li><p>Number of samples:<span class="Apple-converted-space"> </span><code inline="">217</code></p></li><li><p>Access: gated, HF token required</p></li></ul><h3 style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">MIRB</h3><ul style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);"><li><p>Dataset:<span class="Apple-converted-space"> </span><code inline="">VLLMs/MIRB-hf</code></p></li><li><p>Split:<span class="Apple-converted-space"> </span><code inline="">test</code></p></li><li><p>Number of samples:<span class="Apple-converted-space"> </span><code inline="">969</code></p></li><li><p>Access: gated</p></li></ul><hr style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid;"><p style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; caret-color: rgb(255, 255, 255); color: rgb(255, 255, 255);">Could you clarify whether the reported DTS numbers for these two datasets were produced using the same decoding settings, image resolution /<span class="Apple-converted-space"> </span><code inline="">max_pixels</code>, and layer selection? In particular, we wanted to check whether there are any dataset-specific preprocessing steps, evaluation settings, or generation kwargs needed to reproduce the reported values. All our experiments were conducted on A6000 GPUs.</p>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results mismatch on Mantis, and MIRB #1

Experimental setup

Model

Environment variables

B0 baseline runs

DTS runs

Data

Mantis

MIRB

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	Mantis	MIRB
baseline	0.6959	0.5975
DTS (`scale=10`, layers `0–3`)	0.6544	0.5645
Δ (DTS − B0)	−4.15 pp	−3.30 pp

Results mismatch on Mantis, and MIRB #1

Description

Experimental setup

Model

Environment variables

B0 baseline runs

DTS runs

Data

Mantis

MIRB

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions