Accepted to the MAR Workshop at CVPR 2026.
Paper | Preprocessed Dataset | Project Website
This repository contains the code and runnable scripts for POVQA, a preference-optimized framework for video QA that combines temporal pooling with rationale supervision. The repo is set up for fully reproducible runs of:
- Supervised Fine-Tuning (SFT) with QLoRA on interleaved frames + subtitles
- Direct Preference Optimization (DPO) on SFT-initialized policies
- Cross-method evaluation on ReasonVQA and a 5k stratified subset of TVQA
The scripts are self-contained: they discover adapters on disk, mirror train/eval splits, and summarize outputs into JSON suitable for LaTeX table generation.
-
Backbone:
Qwen/Qwen2.5-VL-7B-Instructin 4-bit (QLoRA); forwards compatible with qwen3.5 vl -
Temporal evidence: 4 pooling strategies
blend_blur_with_last_frame (BBLF)weighted_average (WA)weighted_average_exponential (WAE)weighted_average_ramp (WAR)
-
Context shaping (ReasonVQA): up to 59 frames + 1 keyframe (+hint) at eval; 16 frames at train; interleaved with nearest subtitles
-
Projector adaptation: optional LoRA on multimodal projector (
--lora_target_mm_projector) -
SFT: trains method-specific LoRA adapters under
models/sft-qwen7b-interleaved-16f/<method> -
DPO: initializes policy from SFT adapter, uses frozen reference (base + SFT adapter), outputs under
models/dpo-qwen7b-interleaved-16f/<method> -
Evaluation: base model + all adapters over all methods; TVQA val-only stratified 5k (
val_5000_seed42.jsonl) -
Outputs:
.jsonlgenerations +.summary.jsonmetrics per run; LaTeX table generators provided
scripts/
├─ run_sft.sh # Train SFT adapters for all methods
├─ run_sft_eval.sh # Evaluate base + SFT adapters across methods
├─ run_dpo.sh # DPO training for all methods (policy init = SFT)
├─ run_dpo_eval.sh # Evaluate DPO adapters across methods
├─ run_tvqa_eval.sh # TVQA val-only, DPO-only, stratified 5k subset
├─ chain_of_thoughts/
│ ├─ generate_synthetic_movies.py # ReasonVQA generator + evaluator
│ ├─ generate_synthetic_tvqa.py # TVQA generator + evaluator
├─ preprocessing/
│ ├─ video_preprocessing.py # ReasonVQA preprocessing utilities
│ ├─ tvqa_processing.py # TVQA processing utilities
├─ train/
│ ├─ sft_train.py # SFT entrypoint (QLoRA)
│ ├─ dpo_train.py # DPO entrypoint (policy + frozen ref)
└─ visualize/
├─ generate_latex_table_from_metrics.py
├─ generate_latex_ablation.py
├─ generate_latex_delta.py
├─ qualitative.py
└─ qualitative_tvqa.py
-
ReasonVQA (in-house)
annotations/— JSON/JSONL annotations (Q/A + reasoning, timestamps)out_preprocessed/<movie>/<method>/frame_*.png— 59 frames per clip + keyframe- Subtitles aligned at sentence/phrase granularity (nearest to frames)
-
TVQA (public)
-
input_data/frames_hq/(pre-extracted frame folders)tvqa_subtitles/(ASR/subtitles per clip)tvqa_qa_release/(containstvqa_val.jsonl)
-
processed_tvqa/(created byrun_tvqa_eval.sh):val_5000_seed42.jsonl(+.stats.json,.qids.txt) after stratified sampling
-
Note on splits: For ReasonVQA training, a movie-level split is created via --split_ratio 0.9 with --seed 42, mirrored in eval to avoid leakage.
We provide the released preprocessed ReasonVQA artifacts on Hugging Face:
https://huggingface.co/datasets/ashimdahal/povqa-preprocessed
The Hugging Face dataset stores each movie's frame directories as .tar archives to keep the repo upload-friendly. After download, extract those archives back into the out_preprocessed/ structure expected by the training and evaluation scripts.
DATASET_ID="ashimdahal/povqa-preprocessed"
DOWNLOAD_DIR="hf_downloads/povqa-preprocessed"
huggingface-cli download "$DATASET_ID" \
--repo-type dataset \
--local-dir "$DOWNLOAD_DIR"
mkdir -p out_preprocessed
rsync -a \
--exclude README.md \
--exclude manifest.json \
"$DOWNLOAD_DIR"/ out_preprocessed/Then extract all archived frame folders in place:
find out_preprocessed -mindepth 2 -maxdepth 2 -type f -name '*.tar' -print0 | \
while IFS= read -r -d '' archive; do
tar -xf "$archive" -C "$(dirname "$archive")"
doneAfter extraction, each movie directory will again contain folders such as:
KEY_FRAMES/blend_blur_with_last_frame/weighted_average/weighted_average_exponential/weighted_average_ramp/
alongside the metadata_text_centric*.json files and run_summary.json.
If you want to save disk space after extraction, you can remove the downloaded tar archives:
find out_preprocessed -mindepth 2 -maxdepth 2 -type f -name '*.tar' -deleteAt that point, out_preprocessed/ is in the format expected by the rest of this repository for training and evaluation.
We evaluate 4 frame pooling methods:
- BBLF: blend a temporal blur with the last frame (key pose anchoring).
- WA: uniform weighted average of frames.
- WAE: exponential decay weights (recent frames emphasized).
- WAR: linear ramp weights (recent frames emphasized).
Key design choices
- Interleaving (
--interleave): we insert each selected frame followed by its nearest subtitle snippet—improves temporal coherence for LVLMs. - Keyframe: we append the paused/annotated keyframe (
--append_keyframe) and insert a textual hint (--keyframe_hint) that grounds the question in the user’s paused moment. - Projector LoRA: enabling
--lora_target_mm_projectoradapts the visual projector jointly with the language adapter.
- Python 3.10; PyTorch w/ CUDA; HF
transformers/peft/acceleratestack - QLoRA (4-bit) to fit Qwen2.5-VL-7B-Instruct on a single high-VRAM GPU
- Typical training runs used BF16, gradient checkpointing, and grad accumulation to manage memory
- For evaluation speed, you may set
MAX_NEW_TOKENSlower (e.g., 128–512) for TVQA
- Global seed:
--seed 42 - ReasonVQA train/eval split mirrored across SFT/DPO and their eval scripts
- TVQA: deterministic 5k stratified subset by
show_name(seed=42) with allocation stats saved to disk
-
ReasonVQA:
annotations/andout_preprocessed/…exist and match the four method names.
-
TVQA:
input_data/tvqa_qa_release/tvqa_val.jsonlexists- (Optional)
frames_hq/+tvqa_subtitles/for richer context
./scripts/run_sft.sh-
Produces:
models/sft-qwen7b-interleaved-16f/<method>/ -
Key flags inside:
MAX_FRAMES=16,MAX_SEGMENTS=2048(subtitle cap),--interleave,--append_keyframe,--keyframe_hint,--lora_target_mm_projector,--bf16,--gradient_checkpointing
-
Implementation note: we call Python as a module (
python -m scripts.train.sft_train) to make relative imports robust.
./scripts/run_sft_eval.sh-
Base model (no adapter) + every SFT adapter over all four methods
-
Eval context: 59 frames + keyframe (+hint),
--use_4bit,--interleave -
Outputs:
- Base:
runs/base-qwen7b_59f_plus_keyframe/<method>/*.jsonl(.summary.json) - SFT:
runs/sft-qwen7b-interleaved-16f_59f_plus_keyframe/<adapter>/<method>/*.jsonl(.summary.json)
- Base:
./scripts/run_dpo.sh-
Produces:
models/dpo-qwen7b-interleaved-16f/<method>/ -
Reference model: frozen base + SFT adapter (method-matched)
--ref_model_name_or_path= base;--ref_peft_adapter= SFT adapter dir
-
Policy init:
--sft_adapterpoints to the same SFT adapter for warm-start -
Key DPO settings:
BETA=0.3,LEARNING_RATE=5e-6,NUM_EPOCHS=1, correctness-only negatives (--correctness_only)
./scripts/run_dpo_eval.sh- Discovers adapters dynamically under
models/dpo-qwen7b-interleaved-16f/ - Eval context: 59 frames + keyframe (+hint),
--use_4bit,--interleave - Outputs:
runs/dpo-qwen7b-interleaved-16f_59f_plus_keyframe/<adapter>/<method>/*.jsonl(.summary.json)
./scripts/run_tvqa_eval.sh- Step [0]: builds
processed_tvqa/val_5000_seed42.jsonlwith stratified sampling by show, plus stats - Step [1]: evaluates DPO adapters only over four methods (unified eval context)
- Step [2]: prints a tiny accuracy grid (train × eval method) for quick sanity check
- Outputs:
runs/tvqa_dpo_val5k_59f_uniform/<adapter>/<method>/*.jsonl(.summary.json)
Every generation script writes:
*.jsonl— one JSON per example with prompts, model output, (optionally) chain-of-thought/rationale, final answers*.summary.json— aggregated metrics for the file (EM/F1/BLEU/ROUGE-L/embedding metrics, and Accuracy for TVQA)
LaTeX helpers (under visualize/):
generate_latex_table_from_metrics.py— load multiple.summary.jsonand produce leaderboards (bold maxima, method buckets, etc.)generate_latex_ablation.py,generate_latex_delta.py— ablation-style tables and delta tablesqualitative.py,qualitative_tvqa.py— produces qualitative panels (model outputs, human refs, options block inline in the figure caption/section, etc.)
Tip: The table generators expect keys like
EmbedCosandEmbedCos_Reasoning. The scripts already align to those names.
-
--interleaveInterleave each selected frame with its nearest subtitle snippet for temporally grounded context. -
--append_keyframe+--keyframe_hintAppend the user’s paused frame and insert a brief textual hint referencing it (helps localize the question + answer). -
--lora_target_mm_projectorApply LoRA to the multimodal projector in addition to the language layers (improves fusion). -
--length_normalize(DPO) Normalizes sequence-length effects during preference loss. -
--use_4bitLoad the backbone with 4-bit quantization to fit on a single GPU. -
--max_framesvs--max_frames_trainTrain with fewer frames (e.g., 16) for efficiency; evaluate with richer context (e.g., 59 + keyframe). -
--max_segments(_train)Cap the number of subtitle fragments (we default to 2048; set higher to allow more text).
- Splits: ReasonVQA
--split_ratio 0.9,--seed 42are used everywhere and mirrored at eval. - TVQA: The 5k subset is stratified by show_name with deterministic seeding. Allocation stats + selected QIDs are written beside the subset file.
- Relative imports: We always invoke training modules via
python -mto avoid import errors.
-
It runs but is slow:
- Reduce
MAX_NEW_TOKENS(e.g., 128–512) on eval scripts; keepTEMPERATURE=0for determinism. - Use
--limit 100during smoke tests to validate the pipeline quickly.
- Reduce
-
LoRA loading / reference wiring:
- DPO uses base as ref backbone, and if
REF_FROM_SFT=1, we attach the frozen SFT adapter to the reference. - Policy starts from the SFT adapter (passed via
--sft_adapter) and is updated by DPO.
- DPO uses base as ref backbone, and if
-
Out of memory:
- Keep
--use_4bit,--bf16, and--gradient_checkpointingon; reduceBATCH_SIZEor increaseGRAD_ACCUM_STEPS.
- Keep
-
Accidentally interrupted training:
- Scripts write into method-scoped directories. Relaunching will reuse existing folders; delete or rename a run dir to start clean.
- Cross-method tables (base vs SFT vs DPO) under
runs/*_59f_plus_keyframe/… - TVQA sanity grid (train × eval methods) printed at the end of
run_tvqa_eval.sh - Qualitative figures built by
visualize/qualitative*.pythat include an options table inside the human reference section for a more compact layout
- ReasonVQA content is curated for research; subtitles are aligned and truncated for fair-use academic evaluation.
- TVQA val subset uses the official release; our pipeline does not modify QA content, only frames/subtitles selection for context shaping.
If you find this repo helpful, please cite the arXiv version for now. We can update this section once the workshop/proceedings citation is available.
@article{dahal2025povqa,
title = {POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency},
author = {Dahal, Ashim and Ghimire, Ankit and Murad, Saydul Akbar and Rahimi, Nick},
journal = {arXiv preprint arXiv:2510.01009},
year = {2025},
url = {https://arxiv.org/abs/2510.01009}
}
Paper: https://arxiv.org/abs/2510.01009
Project website: https://povqa.github.io
Q: Why 16 frames for training but 59 + keyframe at eval? A: To keep training efficient while evaluating with richer temporal evidence. Empirically, SFT/DPO generalize to denser eval contexts.
Q: What exactly is optimized in DPO? A: We optimize final-answer tokens with correctness-only negatives. Rationale text is used as supervision in SFT; DPO focuses the policy on choosing correct final answers.
Q: Where are method-best scores pulled from for tables?
A: visualize/generate_latex_table_from_metrics.py scans runs/sft-* and runs/dpo-* and reports per-method maxima across SFT or DPO, depending on which is higher for that method.
- Confirm
annotations/andout_preprocessed/exist for ReasonVQA - Confirm
input_data/tvqa_qa_release/tvqa_val.jsonlexists for TVQA -
./scripts/run_sft.sh→ produces SFT adapters -
./scripts/run_sft_eval.sh→ base + SFT results underruns/ -
./scripts/run_dpo.sh→ produces DPO adapters (policy) -
./scripts/run_dpo_eval.sh→ DPO results underruns/ -
./scripts/run_tvqa_eval.sh→ DPO on TVQA val5k + accuracy grid -
visualize/*.py→ LaTeX tables & qualitative figures
If anything is unclear or you’d like an ablation-first quickstart (e.g., BBLF only), here’s a one-liner you can run from repo root to train + eval just that method:
METHODS=(blend_blur_with_last_frame) ./scripts/run_sft.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_sft_eval.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_dpo.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_dpo_eval.shThanks for reviewing!