We introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. Specifically, VideoTree dynamically extracts query-related information from the input video and builds a tree-based video representation for LLM reasoning.
Install environment.
Python 3.8 or above is required.
git clone https://github.com/Ziyang412/VideoTree.git
cd VideoTree
python3 -m venv videetree_env
source activate videetree_env/bin/activate
pip install openai
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pandas
pip install transformers==4.28.1
pip install accelerateDownload dataset annotations and extracted captions.
Download data.zip from the File LLoVi provided.
unzip data.zipYou could extract captions for EgoSchema at ./data. It also contains dataset annotations.
Specifically, LaViLa base model is leveraged to extract EgoSchema captions at 1 FPS.
Download EgoSchema Videos.
Please follow EgoSchema to download the orginal EgoSchema videos. After downloading, please extract the videos into 1 FPS video frames (save in image format for faster loading speed). Please save in the format of ./data/egoschema_frames/{video_id}/{frame_id}.jpg. Then, to further speed up the tree building process, we extract the visual features for each frame using EVA-CLIP-8B and save the features in ./data/egoschema_features/{video_id}.pt.
python data_extraction/extract_images.py
python data_extraction/extract_features.pySince the orginal Kmeans-pytorch package doesn't set a iteration limit and will cause perpetual loop issue, we update the init file of the original kmeans-pytorch package.
git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorchPlease replace the init file in "kmeans_pytorch" folder with the file we provide in "./kmeans_pytorch" folder (this repo). And run the following command.
pip install --editable .Due to the limit of time, we are still updating the codebase. We will also incorporate the scipts/captions for NeXT-QA and IntentQA in the future.
Before you begin: Install the VideoMME dataset (annotations, clips, etc.) into the base directory of this repository, and ensure kmeans_pytorch is set up as shown in the Update Kmeans-pytorch section above.
From the project root:
bash scripts/setup_av_env.shThis creates venv_av (or set VENV_DIR to override), installs torch (CUDA 11.8), requirements_av.txt, and downloads Qwen2-VL and Qwen2-Audio. Or manually:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements_av.txt
python install_qwen.pyAdd OPENAI_API_KEY to .env before running breadth or QA stages.
Activate the environment (source venv_av/bin/activate or source videetree_env/bin/activate if using existing venv). With defaults from util.py:
# 1. Breadth expansion
python adaptive_breath_expansion_av.py --dataset videomme --output_base_path output/videomme_av_breath --output_filename breadth_expansion.json --prompt_type av_rel --disable_eval
# 2. Depth expansion
python depth_expansion_av.py --breadth_path output/videomme_av_breath/breadth_expansion.json --output_base_path output/videomme_av_depth --output_filename depth_expansion_res.json
# 3. QA evaluation
python main_qa_av.py --dataset videomme --tree_node_idx output/videomme_av_depth/depth_expansion_res_by_quid.json --output_base_path output/videomme_av_qa --output_filename qa_results.json --prompt_type vmme_av_qaOverride paths via --anno_path, --clip_feat_path, --clip_media_path, etc.
Please update the feature, asgs (in util.py) and output path before running the code.
sh scripts/breath_expansion.sh
Please update the feature, the output of last step (the relevance output path and first level cluster information) and output path before running the code.
python depth_expansion.py
Please update the tree node index file (output of last step), data files and output path before running the code.
sh scripts/egoschema_qa.sh--save_info: save more information, e.g. token usage, detailed prompts, etc.
--num_examples_to_run: how many examples to run. -1 (default) to run all.
--start_from_scratch: ignore existing output files. Start from scratch.We thank the developers of LLoVi, LifelongMemory, EVA-CLIP, Kmeans-pytorch and SKlearn Clustering for their public code release. We also thank the authors of VideoAgent for the helpful discussion.
Please cite our paper if you use our models in your works:
@InProceedings{Wang_2025_CVPR,
author = {Wang, Ziyang and Yu, Shoubin and Stengel-Eskin, Elias and Yoon, Jaehong and Cheng, Feng and Bertasius, Gedas and Bansal, Mohit},
title = {VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {3272-3283}
}
