Skip to content

Results mismatch on Mantis, and MIRB #1

@Ucicek

Description

@Ucicek

Hi, thanks for releasing the code and results.

We are trying to reproduce the reported numbers, but we are seeing discrepancies on both Mantis and MIRB. The results match reasonably well for MUIR-Bench, so we wanted to check whether there is an implementation detail, preprocessing step, or evaluation setting that we may be missing.

Specifically, we get the following results:

Run Mantis MIRB
baseline 0.6959 0.5975
DTS (scale=10, layers 0–3) 0.6544 0.5645
Δ (DTS − B0) −4.15 pp −3.30 pp

In our reproduction, DTS underperforms the B0 baseline by 4.15 pp on Mantis and 3.30 pp on MIRB. We have also tried different scaling factors (8 and 9) which produce the same results.

Experimental setup

Model

  • Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Precision: bf16
  • Attention backend: attn_implementation=flash_attention_2
    • flash-attn==2.7.4.post1 installed in the delimscale conda environment
  • Device map: cuda
    • Single-GPU run
  • Batch size: 1
  • max_pixels: 614656
    • Approximately 784 × 784
    • Used to reduce the image-token count and avoid OOM on MIRB
  • min_pixels: default value
    • 256 × 28 × 28 = 200704

Environment variables

export HF_HOME=/path/to/hf_cache
export TRANSFORMERS_CACHE=/path/to/hf_cache
export PYTHONPATH=/path/to/DelimScaling
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

B0 baseline runs

Scripts:

  • scripts/mantis/run_b0.sh

  • scripts/mirb/run_b0.sh

Command template:

CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
  --tasks {mantis|mirb} \
  --batch_size 1 \
  --delim_scaling False \
  --output_path $OUT_DIR \
  --log_samples

This is a pure vanilla forward pass with no interventions.


DTS runs

Scripts:

  • scripts/mantis/run_dts.sh

  • scripts/mirb/run_dts.sh

Command template:

CUDA_VISIBLE_DEVICES=$GPU_ID python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=cuda,attn_implementation=flash_attention_2,max_pixels=614656" \
  --tasks {mantis|mirb} \
  --batch_size 1 \
  --delim_scaling True \
  --scale 10 \
  --select_layer 0,1,2,3 \
  --output_path $OUT_DIR \
  --log_samples

Data

Mantis

  • Dataset: TIGER-Lab/Mantis-Eval

  • Split: test

  • Number of samples: 217

  • Access: gated, HF token required

MIRB

  • Dataset: VLLMs/MIRB-hf

  • Split: test

  • Number of samples: 969

  • Access: gated


Could you clarify whether the reported DTS numbers for these two datasets were produced using the same decoding settings, image resolution / max_pixels, and layer selection? In particular, we wanted to check whether there are any dataset-specific preprocessing steps, evaluation settings, or generation kwargs needed to reproduce the reported values. All our experiments were conducted on A6000 GPUs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions