Skip to content

ShiqiangLang/Free2Frame

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Free2Frame: A Training-Free Framework for Video Understanding with Memory Boosting

Free2Frame is a training-free video understanding framework for video question answering. It selects and organizes informative frames to support multimodal large language models, enabling efficient inference and evaluation across multiple video QA benchmarks without additional model training.

Free2Frame pipeline

Installation

conda create -n free2frame python=3.10 -y
conda activate free2frame
pip install -r requirements.txt

Model and Data Layout

Run commands from the project root. By default, scripts look for models, data, and outputs under these paths:

Free2Frame/
  checkpoints/
    llava-v1.6-7b/
    clip-vit-base-patch32/
  data/
    gt_qa_files/
      MSRVTT_Zero_Shot_QA/{val_q.json,val_a.json}
      MSVD_Zero_Shot_QA/{val_q.json,val_a.json}
      TGIF_Zero_Shot_QA/{test_q.json,test_a.json}
      Activitynet_Zero_Shot_QA/{test_q.json,test_a.json}
      EgoSchema/val_qa.json
      IntentQA/val_qa.json
      NExTQA/val_qa.json
      MLVU/test_multi_choice_tasks.json
      VCGBench/
    videos/
      MSRVTT-QA/videos/
      MSVD-QA/videos/
      TGIF_Zero_Shot_QA/all_test/
      Activitynet_Zero_Shot_QA/all_test/
      egoschema/videos/
      intentqa/videos/
      nextqa/NExTVideo/
      MLVU_Test/video/
    MVBench/
      json/counterfactual_inference.json
      video/clevrer/video_validation/
  outputs/

You can override paths without editing scripts:

export MODEL_PATH=/path/to/llava-v1.6-7b
export CLIP_MODEL_PATH=/path/to/clip-vit-base-patch32
export DATA_DIR=/path/to/data
export GT_QA_DIR=/path/to/gt_qa_files
export VIDEO_DIR=/path/to/videos
export OUTPUT_ROOT=/path/to/outputs

Dataset-specific overrides are also available, for example:

export MSRVTT_VIDEO_DIR=/path/to/MSRVTT-QA/videos
export MSVD_VIDEO_DIR=/path/to/MSVD-QA/videos
export MVBENCH_VIDEO_DIR=/path/to/MVBench/video/clevrer/video_validation
export MVBENCH_QA_FILE=/path/to/MVBench/json/counterfactual_inference.json

Data Preparation

Dataset converters are available in scripts/data:

python scripts/data/build_msrvtt_qa.py --qa_file /path/to/MSRVTT_QA.csv
python scripts/data/build_msvd_qa.py --qa_file /path/to/MSVD_QA.csv
python scripts/data/build_tgif_qa.py --qa_file /path/to/TGIF_FrameQA.csv
python scripts/data/build_activitynet_qa.py --qa_file /path/to/Activitynet_QA.csv
python scripts/data/build_egoschema_qa.py --qa_file /path/to/EgoSchema.csv
python scripts/data/build_intentqa_qa.py --qa_file /path/to/IntentQA.csv
python scripts/data/build_nextqa_qa.py --qa_file /path/to/NExT_QA.csv
python scripts/data/build_vcgbench_qa.py --qa_folder /path/to/text_generation_benchmark

By default, converted files are written to data/gt_qa_files. Use --output_root to write them elsewhere.

OpenRouter Evaluation API

GPT-based evaluation uses OpenRouter through the OpenAI-compatible API. Set the key in your environment:

export OPENROUTER_API_KEY=your_openrouter_api_key
export OPENROUTER_MODEL=openai/gpt-3.5-turbo
export OPENROUTER_APP_NAME=Free2Frame

Inference

Inference scripts share this argument order:

aggregation_method num_frames num_sampled_tokens prompt_version image_aspect_ratio

Examples:

CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_qa_msrvtt.sh N2 50 2880 v3 resize
CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_qa_msvd.sh N2 50 2880 v3 resize
CUDA_VISIBLE_DEVICES=0 bash scripts/infer_video/run_mvbench.sh N2 50 2880 v3 resize video

Outputs are saved under outputs/ and merged into merge.jsonl.

Evaluation

Use the unified evaluation entrypoint:

bash scripts/eval/evaluate.sh msrvtt N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh msvd N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh tgif N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh anet N2 50 2880 v3 resize

For generative QA, run inference for each split first, then evaluate all splits together:

bash scripts/infer_video/run_gen_qa_consistency.sh N2 50 2880 v3 resize
bash scripts/infer_video/run_gen_qa_generic.sh N2 50 2880 v3 resize
bash scripts/infer_video/run_gen_qa_temporal.sh N2 50 2880 v3 resize
bash scripts/eval/evaluate.sh gen_qa N2 50 2880 v3 resize

Citation

@INPROCEEDINGS{11462196,
  author={Lang, Shiqiang and Sun, Peiwen and Jiang, Hao and Zhu, Shuyuan and Zhao, Huiying and Yang, Lan and Zhang, Honggang},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Free2frame: A Training-Free Framework for Video Understanding with Memory Boosting},
  year={2026},
  volume={},
  number={},
  pages={10672-10676},
  keywords={Memory modules;Filtering;Filters;Printed circuits;Band-pass filters;Filter banks;Videos;Location awareness;Communication systems;LoRa;Video Understanding;Training-Free;Video Content Analysis;Multimodal Large Language Models},
  doi={10.1109/ICASSP55912.2026.11462196}
}

About

The official repo for Free2Frame: A Training-Free Framework for Video Understanding with Memory Boosting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors