POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Accepted to the MAR Workshop at CVPR 2026.

Paper | Preprocessed Dataset | Project Website

This repository contains the code and runnable scripts for POVQA, a preference-optimized framework for video QA that combines temporal pooling with rationale supervision. The repo is set up for fully reproducible runs of:

Supervised Fine-Tuning (SFT) with QLoRA on interleaved frames + subtitles
Direct Preference Optimization (DPO) on SFT-initialized policies
Cross-method evaluation on ReasonVQA and a 5k stratified subset of TVQA

The scripts are self-contained: they discover adapters on disk, mirror train/eval splits, and summarize outputs into JSON suitable for LaTeX table generation.

Overview

Backbone: Qwen/Qwen2.5-VL-7B-Instruct in 4-bit (QLoRA); forwards compatible with qwen3.5 vl
Temporal evidence: 4 pooling strategies
- blend_blur_with_last_frame (BBLF)
- weighted_average (WA)
- weighted_average_exponential (WAE)
- weighted_average_ramp (WAR)
Context shaping (ReasonVQA): up to 59 frames + 1 keyframe (+hint) at eval; 16 frames at train; interleaved with nearest subtitles
Projector adaptation: optional LoRA on multimodal projector (--lora_target_mm_projector)
SFT: trains method-specific LoRA adapters under models/sft-qwen7b-interleaved-16f/<method>
DPO: initializes policy from SFT adapter, uses frozen reference (base + SFT adapter), outputs under models/dpo-qwen7b-interleaved-16f/<method>
Evaluation: base model + all adapters over all methods; TVQA val-only stratified 5k (val_5000_seed42.jsonl)
Outputs: .jsonl generations + .summary.json metrics per run; LaTeX table generators provided

Repository Layout (relevant to runs)

scripts/
├─ run_sft.sh                 # Train SFT adapters for all methods
├─ run_sft_eval.sh            # Evaluate base + SFT adapters across methods
├─ run_dpo.sh                 # DPO training for all methods (policy init = SFT)
├─ run_dpo_eval.sh            # Evaluate DPO adapters across methods
├─ run_tvqa_eval.sh           # TVQA val-only, DPO-only, stratified 5k subset

├─ chain_of_thoughts/
│  ├─ generate_synthetic_movies.py  # ReasonVQA generator + evaluator
│  ├─ generate_synthetic_tvqa.py    # TVQA generator + evaluator

├─ preprocessing/
│  ├─ video_preprocessing.py        # ReasonVQA preprocessing utilities
│  ├─ tvqa_processing.py            # TVQA processing utilities

├─ train/
│  ├─ sft_train.py                  # SFT entrypoint (QLoRA)
│  ├─ dpo_train.py                  # DPO entrypoint (policy + frozen ref)

└─ visualize/
   ├─ generate_latex_table_from_metrics.py
   ├─ generate_latex_ablation.py
   ├─ generate_latex_delta.py
   ├─ qualitative.py
   └─ qualitative_tvqa.py

Data & Expected Folders

ReasonVQA (in-house)
- annotations/ — JSON/JSONL annotations (Q/A + reasoning, timestamps)
- out_preprocessed/<movie>/<method>/frame_*.png — 59 frames per clip + keyframe
- Subtitles aligned at sentence/phrase granularity (nearest to frames)
TVQA (public)
- input_data/
  - frames_hq/ (pre-extracted frame folders)
  - tvqa_subtitles/ (ASR/subtitles per clip)
  - tvqa_qa_release/ (contains tvqa_val.jsonl)
- processed_tvqa/ (created by run_tvqa_eval.sh):
  - val_5000_seed42.jsonl (+ .stats.json, .qids.txt) after stratified sampling

Note on splits: For ReasonVQA training, a movie-level split is created via --split_ratio 0.9 with --seed 42, mirrored in eval to avoid leakage.

Downloading the Preprocessed Release from Hugging Face

We provide the released preprocessed ReasonVQA artifacts on Hugging Face:

https://huggingface.co/datasets/ashimdahal/povqa-preprocessed

The Hugging Face dataset stores each movie's frame directories as .tar archives to keep the repo upload-friendly. After download, extract those archives back into the out_preprocessed/ structure expected by the training and evaluation scripts.

DATASET_ID="ashimdahal/povqa-preprocessed"
DOWNLOAD_DIR="hf_downloads/povqa-preprocessed"

huggingface-cli download "$DATASET_ID" \
  --repo-type dataset \
  --local-dir "$DOWNLOAD_DIR"

mkdir -p out_preprocessed
rsync -a \
  --exclude README.md \
  --exclude manifest.json \
  "$DOWNLOAD_DIR"/ out_preprocessed/

Then extract all archived frame folders in place:

find out_preprocessed -mindepth 2 -maxdepth 2 -type f -name '*.tar' -print0 | \
  while IFS= read -r -d '' archive; do
    tar -xf "$archive" -C "$(dirname "$archive")"
  done

After extraction, each movie directory will again contain folders such as:

KEY_FRAMES/
blend_blur_with_last_frame/
weighted_average/
weighted_average_exponential/
weighted_average_ramp/

alongside the metadata_text_centric*.json files and run_summary.json.

If you want to save disk space after extraction, you can remove the downloaded tar archives:

find out_preprocessed -mindepth 2 -maxdepth 2 -type f -name '*.tar' -delete

At that point, out_preprocessed/ is in the format expected by the rest of this repository for training and evaluation.

Temporal Pooling & Context Shaping

We evaluate 4 frame pooling methods:

BBLF: blend a temporal blur with the last frame (key pose anchoring).
WA: uniform weighted average of frames.
WAE: exponential decay weights (recent frames emphasized).
WAR: linear ramp weights (recent frames emphasized).

Key design choices

Interleaving (--interleave): we insert each selected frame followed by its nearest subtitle snippet—improves temporal coherence for LVLMs.
Keyframe: we append the paused/annotated keyframe (--append_keyframe) and insert a textual hint (--keyframe_hint) that grounds the question in the user’s paused moment.
Projector LoRA: enabling --lora_target_mm_projector adapts the visual projector jointly with the language adapter.

Environment & Hardware

Python 3.10; PyTorch w/ CUDA; HF transformers/peft/accelerate stack
QLoRA (4-bit) to fit Qwen2.5-VL-7B-Instruct on a single high-VRAM GPU
Typical training runs used BF16, gradient checkpointing, and grad accumulation to manage memory
For evaluation speed, you may set MAX_NEW_TOKENS lower (e.g., 128–512) for TVQA

Reproducibility Settings

Global seed: --seed 42
ReasonVQA train/eval split mirrored across SFT/DPO and their eval scripts
TVQA: deterministic 5k stratified subset by show_name (seed=42) with allocation stats saved to disk

How to Run (Minimal End-to-End)

0) Verify data layout

ReasonVQA:
- annotations/ and out_preprocessed/… exist and match the four method names.
TVQA:
- input_data/tvqa_qa_release/tvqa_val.jsonl exists
- (Optional) frames_hq/ + tvqa_subtitles/ for richer context

1) Train SFT adapters (ReasonVQA)

./scripts/run_sft.sh

Produces: models/sft-qwen7b-interleaved-16f/<method>/
Key flags inside:
- MAX_FRAMES=16, MAX_SEGMENTS=2048 (subtitle cap), --interleave, --append_keyframe, --keyframe_hint, --lora_target_mm_projector, --bf16, --gradient_checkpointing
Implementation note: we call Python as a module (python -m scripts.train.sft_train) to make relative imports robust.

2) Evaluate base + SFT adapters (ReasonVQA)

./scripts/run_sft_eval.sh

Base model (no adapter) + every SFT adapter over all four methods
Eval context: 59 frames + keyframe (+hint), --use_4bit, --interleave
Outputs:
- Base: runs/base-qwen7b_59f_plus_keyframe/<method>/*.jsonl(.summary.json)
- SFT: runs/sft-qwen7b-interleaved-16f_59f_plus_keyframe/<adapter>/<method>/*.jsonl(.summary.json)

3) Train DPO policies (ReasonVQA)

./scripts/run_dpo.sh

Produces: models/dpo-qwen7b-interleaved-16f/<method>/
Reference model: frozen base + SFT adapter (method-matched)
- --ref_model_name_or_path = base; --ref_peft_adapter = SFT adapter dir
Policy init: --sft_adapter points to the same SFT adapter for warm-start
Key DPO settings: BETA=0.3, LEARNING_RATE=5e-6, NUM_EPOCHS=1, correctness-only negatives (--correctness_only)

4) Evaluate DPO policies (ReasonVQA)

./scripts/run_dpo_eval.sh

Discovers adapters dynamically under models/dpo-qwen7b-interleaved-16f/
Eval context: 59 frames + keyframe (+hint), --use_4bit, --interleave
Outputs: runs/dpo-qwen7b-interleaved-16f_59f_plus_keyframe/<adapter>/<method>/*.jsonl(.summary.json)

5) Evaluate DPO on TVQA (val-only 5k stratified)

./scripts/run_tvqa_eval.sh

Step [0]: builds processed_tvqa/val_5000_seed42.jsonl with stratified sampling by show, plus stats
Step [1]: evaluates DPO adapters only over four methods (unified eval context)
Step [2]: prints a tiny accuracy grid (train × eval method) for quick sanity check
Outputs: runs/tvqa_dpo_val5k_59f_uniform/<adapter>/<method>/*.jsonl(.summary.json)

Metrics, Summaries & Tables

Every generation script writes:

*.jsonl — one JSON per example with prompts, model output, (optionally) chain-of-thought/rationale, final answers
*.summary.json — aggregated metrics for the file (EM/F1/BLEU/ROUGE-L/embedding metrics, and Accuracy for TVQA)

LaTeX helpers (under visualize/):

generate_latex_table_from_metrics.py — load multiple .summary.json and produce leaderboards (bold maxima, method buckets, etc.)
generate_latex_ablation.py, generate_latex_delta.py — ablation-style tables and delta tables
qualitative.py, qualitative_tvqa.py — produces qualitative panels (model outputs, human refs, options block inline in the figure caption/section, etc.)

Tip: The table generators expect keys like EmbedCos and EmbedCos_Reasoning. The scripts already align to those names.

Key Flags (What They Do)

--interleave Interleave each selected frame with its nearest subtitle snippet for temporally grounded context.
--append_keyframe + --keyframe_hint Append the user’s paused frame and insert a brief textual hint referencing it (helps localize the question + answer).
--lora_target_mm_projector Apply LoRA to the multimodal projector in addition to the language layers (improves fusion).
--length_normalize (DPO) Normalizes sequence-length effects during preference loss.
--use_4bit Load the backbone with 4-bit quantization to fit on a single GPU.
--max_frames vs --max_frames_train Train with fewer frames (e.g., 16) for efficiency; evaluate with richer context (e.g., 59 + keyframe).
--max_segments(_train) Cap the number of subtitle fragments (we default to 2048; set higher to allow more text).

Repro Notes & Determinism

Splits: ReasonVQA --split_ratio 0.9, --seed 42 are used everywhere and mirrored at eval.
TVQA: The 5k subset is stratified by show_name with deterministic seeding. Allocation stats + selected QIDs are written beside the subset file.
Relative imports: We always invoke training modules via python -m to avoid import errors.

Troubleshooting & Tips

It runs but is slow:
- Reduce MAX_NEW_TOKENS (e.g., 128–512) on eval scripts; keep TEMPERATURE=0 for determinism.
- Use --limit 100 during smoke tests to validate the pipeline quickly.
LoRA loading / reference wiring:
- DPO uses base as ref backbone, and if REF_FROM_SFT=1, we attach the frozen SFT adapter to the reference.
- Policy starts from the SFT adapter (passed via --sft_adapter) and is updated by DPO.
Out of memory:
- Keep --use_4bit, --bf16, and --gradient_checkpointing on; reduce BATCH_SIZE or increase GRAD_ACCUM_STEPS.
Accidentally interrupted training:
- Scripts write into method-scoped directories. Relaunching will reuse existing folders; delete or rename a run dir to start clean.

Results Surfaces (what to look for)

Cross-method tables (base vs SFT vs DPO) under runs/*_59f_plus_keyframe/…
TVQA sanity grid (train × eval methods) printed at the end of run_tvqa_eval.sh
Qualitative figures built by visualize/qualitative*.py that include an options table inside the human reference section for a more compact layout

Ethical & Data Notes

ReasonVQA content is curated for research; subtitles are aligned and truncated for fair-use academic evaluation.
TVQA val subset uses the official release; our pipeline does not modify QA content, only frames/subtitles selection for context shaping.

Citation

If you find this repo helpful, please cite the arXiv version for now. We can update this section once the workshop/proceedings citation is available.

@article{dahal2025povqa,
  title   = {POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency},
  author  = {Dahal, Ashim and Ghimire, Ankit and Murad, Saydul Akbar and Rahimi, Nick},
  journal = {arXiv preprint arXiv:2510.01009},
  year    = {2025},
  url     = {https://arxiv.org/abs/2510.01009}
}

Paper: https://arxiv.org/abs/2510.01009

Project website: https://povqa.github.io

FAQ

Q: Why 16 frames for training but 59 + keyframe at eval? A: To keep training efficient while evaluating with richer temporal evidence. Empirically, SFT/DPO generalize to denser eval contexts.

Q: What exactly is optimized in DPO? A: We optimize final-answer tokens with correctness-only negatives. Rationale text is used as supervision in SFT; DPO focuses the policy on choosing correct final answers.

Q: Where are method-best scores pulled from for tables? A: visualize/generate_latex_table_from_metrics.py scans runs/sft-* and runs/dpo-* and reports per-method maxima across SFT or DPO, depending on which is higher for that method.

Reproduction Checklist

Confirm annotations/ and out_preprocessed/ exist for ReasonVQA
Confirm input_data/tvqa_qa_release/tvqa_val.jsonl exists for TVQA
./scripts/run_sft.sh → produces SFT adapters
./scripts/run_sft_eval.sh → base + SFT results under runs/
./scripts/run_dpo.sh → produces DPO adapters (policy)
./scripts/run_dpo_eval.sh → DPO results under runs/
./scripts/run_tvqa_eval.sh → DPO on TVQA val5k + accuracy grid
visualize/*.py → LaTeX tables & qualitative figures

If anything is unclear or you’d like an ablation-first quickstart (e.g., BBLF only), here’s a one-liner you can run from repo root to train + eval just that method:

METHODS=(blend_blur_with_last_frame) ./scripts/run_sft.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_sft_eval.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_dpo.sh && \
METHODS=(blend_blur_with_last_frame) ./scripts/run_dpo_eval.sh

Thanks for reviewing!

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
annotations		annotations
motion_blurred_frames		motion_blurred_frames
scripts		scripts
.gitignore		.gitignore
cot.jsonl		cot.jsonl
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Overview

Repository Layout (relevant to runs)

Data & Expected Folders

Downloading the Preprocessed Release from Hugging Face

Temporal Pooling & Context Shaping

Environment & Hardware

Reproducibility Settings

How to Run (Minimal End-to-End)

0) Verify data layout

1) Train SFT adapters (ReasonVQA)

2) Evaluate base + SFT adapters (ReasonVQA)

3) Train DPO policies (ReasonVQA)

4) Evaluate DPO policies (ReasonVQA)

5) Evaluate DPO on TVQA (val-only 5k stratified)

Metrics, Summaries & Tables

Key Flags (What They Do)

Repro Notes & Determinism

Troubleshooting & Tips

Results Surfaces (what to look for)

Ethical & Data Notes

Citation

FAQ

Reproduction Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Overview

Repository Layout (relevant to runs)

Data & Expected Folders

Downloading the Preprocessed Release from Hugging Face

Temporal Pooling & Context Shaping

Environment & Hardware

Reproducibility Settings

How to Run (Minimal End-to-End)

0) Verify data layout

1) Train SFT adapters (ReasonVQA)

2) Evaluate base + SFT adapters (ReasonVQA)

3) Train DPO policies (ReasonVQA)

4) Evaluate DPO policies (ReasonVQA)

5) Evaluate DPO on TVQA (val-only 5k stratified)

Metrics, Summaries & Tables

Key Flags (What They Do)

Repro Notes & Determinism

Troubleshooting & Tips

Results Surfaces (what to look for)

Ethical & Data Notes

Citation

FAQ

Reproduction Checklist

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages