Free2Frame is a training-free video understanding framework for video question answering. It selects and organizes informative frames to support multimodal large language models, enabling efficient inference and evaluation across multiple video QA benchmarks without additional model training.
conda create -n free2frame python=3.10 -y
conda activate free2frame
pip install -r requirements.txtRun commands from the project root. By default, scripts look for models, data, and outputs under these paths:
Free2Frame/
checkpoints/
llava-v1.6-7b/
clip-vit-base-patch32/
data/
gt_qa_files/
MSRVTT_Zero_Shot_QA/{val_q.json,val_a.json}
MSVD_Zero_Shot_QA/{val_q.json,val_a.json}
TGIF_Zero_Shot_QA/{test_q.json,test_a.json}
Activitynet_Zero_Shot_QA/{test_q.json,test_a.json}
EgoSchema/val_qa.json
IntentQA/val_qa.json
NExTQA/val_qa.json
MLVU/test_multi_choice_tasks.json
VCGBench/
videos/
MSRVTT-QA/videos/
MSVD-QA/videos/
TGIF_Zero_Shot_QA/all_test/
Activitynet_Zero_Shot_QA/all_test/
egoschema/videos/
intentqa/videos/
nextqa/NExTVideo/
MLVU_Test/video/
MVBench/
json/counterfactual_inference.json
video/clevrer/video_validation/
outputs/
You can override paths without editing scripts:
export MODEL_PATH=/path/to/llava-v1.6-7b
export CLIP_MODEL_PATH=/path/to/clip-vit-base-patch32
export DATA_DIR=/path/to/data
export GT_QA_DIR=/path/to/gt_qa_files
export VIDEO_DIR=/path/to/videos
export OUTPUT_ROOT=/path/to/outputsDataset-specific overrides are also available, for example:
export MSRVTT_VIDEO_DIR=/path/to/MSRVTT-QA/videos
export MSVD_VIDEO_DIR=/path/to/MSVD-QA/videos
export MVBENCH_VIDEO_DIR=/path/to/MVBench/video/clevrer/video_validation
export MVBENCH_QA_FILE=/path/to/MVBench/json/counterfactual_inference.jsonDataset converters are available in scripts/data:
python scripts/data/build_msrvtt_qa.py --qa_file /path/to/MSRVTT_QA.csv
python scripts/data/build_msvd_qa.py --qa_file /path/to/MSVD_QA.csv
python scripts/data/build_tgif_qa.py --qa_file /path/to/TGIF_FrameQA.csv
python scripts/data/build_activitynet_qa.py --qa_file /path/to/Activitynet_QA.csv
python scripts/data/build_egoschema_qa.py --qa_file /path/to/EgoSchema.csv
python scripts/data/build_intentqa_qa.py --qa_file /path/to/IntentQA.csv
python scripts/data/build_nextqa_qa.py --qa_file /path/to/NExT_QA.csv
python scripts/data/build_vcgbench_qa.py --qa_folder /path/to/text_generation_benchmarkBy default, converted files are written to data/gt_qa_files. Use --output_root to write them elsewhere.
GPT-based evaluation uses OpenRouter through the OpenAI-compatible API. Set the key in your environment:
export OPENROUTER_API_KEY=your_openrouter_api_key
export OPENROUTER_MODEL=openai/gpt-3.5-turbo
export OPENROUTER_APP_NAME=Free2FrameInference scripts share this argument order:
aggregation_method num_frames num_sampled_tokens prompt_version image_aspect_ratio
Examples:
CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_qa_msrvtt.sh N2 50 2880 v3 resize
CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_qa_msvd.sh N2 50 2880 v3 resize
CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_mvbench.sh N2 50 2880 v3 resize videoOutputs are saved under outputs/ and merged into merge.jsonl.
Use the unified evaluation entrypoint:
bash scripts/eval/evaluate.sh msrvtt N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh msvd N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh tgif N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh anet N2 50 2880 v3 resizeFor generative QA, run inference for each split first, then evaluate all splits together:
bash scripts/infer_video/run_gen_qa_consistency.sh N2 50 2880 v3 resize
bash scripts/infer_video/run_gen_qa_generic.sh N2 50 2880 v3 resize
bash scripts/infer_video/run_gen_qa_temporal.sh N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh gen_qa N2 50 2880 v3 resize@INPROCEEDINGS{11462196,
author={Lang, Shiqiang and Sun, Peiwen and Jiang, Hao and Zhu, Shuyuan and Zhao, Huiying and Yang, Lan and Zhang, Honggang},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Free2frame: A Training-Free Framework for Video Understanding with Memory Boosting},
year={2026},
volume={},
number={},
pages={10672-10676},
keywords={Memory modules;Filtering;Filters;Printed circuits;Band-pass filters;Filter banks;Videos;Location awareness;Communication systems;LoRa;Video Understanding;Training-Free;Video Content Analysis;Multimodal Large Language Models},
doi={10.1109/ICASSP55912.2026.11462196}
}