by Jiajin Tang*, Zhengxuan Wei*, Yuchen Zhu, Cheng Shi, Guanbin Li, Liang Lin, Sibei Yang†
*Equal contribution; †Corresponding Author
git clone https://github.com/SooLab/Sim-DETR.git
cd Sim-DETR
We use video features (CLIP and SlowFast) and text features (CLIP) as inputs. For CLIP, we utilize the features extracted by R2-Tuning (from the last four layers), but we retain only the [CLS] token per frame to ensure efficiency. You can download our prepared feature files from qvhighlights_features and unzip them to your data root directory.
For Anaconda setup, refer to the official Moment-DETR GitHub.
Update feat_root in sim_detr/scripts/train.sh to the path where you saved the features, then run:
bash sim_detr/scripts/train.sh After training, you can generate hl_val_submission.jsonl and hl_test_submission.jsonl for validation and test sets by running:
bash sim_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash sim_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'
Replace {direc} with the path to your saved checkpoint. For more details on submission, see standalone_eval/README.md.
If you find this repository useful, please cite our work:
@inproceedings{tang2025sim,
title={Sim-DETR: Unlock DETR for Temporal Sentence Grounding},
author={Tang, Jiajin and Wei, Zhengxuan and Zhu, Yuchen and Shi, Cheng and Li, Guanbin and Lin, Liang and Yang, Sibei},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={22760--22771},
year={2025}
}
The annotation files and parts of the implementation are borrowed from Moment-DETR and TR-DETR. Consequently, our code is also released under the MIT License.