CVPR 2026 · 📄 Paper
Narges Norouzi1, Idil Esen Zulfikar2,*, Niccolò Cavagnero1,*, Tommie Kerssies1, Bastian Leibe2, Gijs Dubbelman1, Daan de Geus1
¹ Eindhoven University of Technology, ² RWTH Aachen University, * Equal contribution
📢 Announcement: We released PMT (Plain Mask Transformer), the next generation of VidEoMT — a segmentation model that works on top of frozen Vision Foundation Model features, requiring no encoder finetuning. The encoder stays fully frozen and shareable across tasks, while matching the accuracy and speed of finetuned alternatives.
Both the research paper and the full source code are publicly available: 📄 Paper · 💻 Code
We introduce Video Encoder-only Mask Transformer (VidEoMT), a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It performs both spatial and temporal reasoning within the ViT encoder, without relying on dedicated tracking modules or heavy task-specific heads.
VidEoMT propagates information over time by reusing queries from the previous frame and fusing them with a compact set of learned, frame-agnostic queries. This design achieves competitive accuracy while being 5x–10× faster than existing approaches, reaching up to 160 FPS with a ViT-L backbone.
If you don't have Conda installed, install Miniconda and restart your shell:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.shThen create the environment, activate it, and install the dependencies:
conda create -n videomt python==3.12.3
conda activate videomt
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
python -m pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git'
pip install git+https://github.com/cocodataset/panopticapi.git
python3 -m pip install -r requirements.txtWeights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:
wandb loginDownload and prepare the datasets.
To evaluate a pre-trained VidEoMT model, first prepare the datasets by following the instructions in this link and download the trained weights from here. Once these are set up, run:
python train_net_video.py \
--num-gpus 1 \
--config-file /path/to/config.yaml \
--eval-only MODEL.WEIGHTS /path/to/weight.pth \
MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
OUTPUT_DIR /path/to/output🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
🔧 Replace /path/to/output with the path to the output folder.
🔧 Change the value of --num-gpus to the number of GPUs available to you.
For detailed instructions on running evaluation on different datasets, see Evaluation.
To train an online VidEoMT model, run:
python3 train_net_video.py \
--num-gpus 4 \
--num-machines 2 \
--config-file /path/to/config.yaml \
MODEL.WEIGHTS /path/to/segmenter_weight.pth \
MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
OUTPUT_DIR /path/to/outputReplace /path/to/segmenter_weight.pth with the segmenter checkpoint used to initialize training. For DINOv2 models, choose this weight from the Init Weights column in DINOv2 Models.
Replace /path/to/output with the directory where training logs and checkpoints should be written.
To calculate the FPS and GFLOPs, run:
# DINOv2 FPS
python benchmark.py \
--task fps \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--warmup-iters 100 \
--model-type dinov2
# DINOv3 FPS
python benchmark.py \
--task fps \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--warmup-iters 100 \
--model-type dinov3 \
--fused-qkv
export TIMM_FUSED_ATTN=0
python benchmark.py \
--task flops \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--model-type dinov2For DINOv3 FPS benchmarking, enable --fused-qkv. This is recommended to get FPS closer to the DINOv2 setup.
🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
We provide example visualizations below.
To generate additional visualization samples, please use the code in Visualization.
- [x] Inference code
- [x] Flops and FPS code
- [x] Visualization code
- [x] Training codes
- [x] DINOv3 model zoo and code
We provide pre-trained weights for both DINOv2- and DINOv3-based VidEoMT models.
- DINOv2 Models - Original published results and pre-trained weights.
- DINOv3 Models - DINOv3-based models and pre-trained weights.
If you find this work useful in your research, please cite it using the BibTeX entry below:
@inproceedings{norouzi2026videomt,
author = {Norouzi, Narges and Zulfikar, Idil and Cavagnero, Niccol\`{o} and Kerssies, Tommie and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},
title = {{VidEoMT: Your ViT is Secretly Also a Video Segmentation Model}},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}This project builds upon code from the following libraries and repositories:
- EoMT (MIT License)
- Hugging Face Transformers (Apache-2.0 License)
- PyTorch Image Models (timm) (Apache-2.0 License)
- CAVIS (MIT License)
- Mask2Former (Apache-2.0 License)
- Detectron2 (Apache-2.0 License)




