A curated list of papers and resources on reinforcement learning, RLHF, and preference optimization for image and video diffusion models.
This repository focuses on RL for diffusion generation , especially image and video generation. It does not track the separate line of work on diffusion models used as policies, planners, or dynamics models in reinforcement learning.
This list is maintained with weekly updates , with priority given to recent high-impact arXiv papers, papers accepted by top-tier conferences, and influential open-source resources.
Included:
RLHF and policy-gradient optimization for image or video diffusion models.
DPO and preference optimization for text-to-image, image-to-image, text-to-video, image-to-video, and video-to-video generation.
GRPO, RLOO, PPO-style, and related policy optimization methods for diffusion post-training.
Stochastic optimal control, surrogate-objective, and reward-corrected regression methods for diffusion or flow post-training.
Reward models, feedback signals, and benchmarks used to optimize or evaluate image/video diffusion outputs.
Reward guidance and test-time optimization methods when they directly target image or video diffusion alignment.
Out of scope:
Diffusion models used as RL policies, planners, or dynamics models.
General LLM RLHF papers without a direct image/video diffusion application.
Audio, 3D, robotics, molecule, or text diffusion work, except as brief related context.
Task tags: T2I, I2I, T2V, I2V, V2V, Unified
Method tags: RLHF, Policy Gradient, PPO, GRPO, RLOO, SOC, Adjoint, Surrogate, Likelihood Estimation, Reward-Corrected Regression, Score Matching, DPO, Reward Model, Reward Guidance, Benchmark, Toolkit
Reward Models & Feedback Signals
Year
Paper
Venue
Task
Method
Resources
2026
VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
arXiv
T2V
Reward Model, Reward Guidance
Paper
2026
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
arXiv
T2I
Reward Model
Paper
2025
VideoScore2: Think before You Score in Generative Video Evaluation
arXiv
T2V
Reward Model, GRPO
Paper , Project
2025
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
ICCV
T2I
Reward Model, DPO
Paper , CVF , Project
2025
Improving Video Generation with Human Feedback
NeurIPS
T2V
Reward Model, RLHF
Paper , Project , Code
2024
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
arXiv
Unified
Reward Model
Paper , Code
2024
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
arXiv
T2V
Reward Model
Paper
2023
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
NeurIPS
T2I
Reward Model, RLHF
Paper , Code
2023
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
NeurIPS
T2I
Reward Model
Paper , Code
RL Post-Training Taxonomy
Following the taxonomy used by Reinforce Adjoint Matching (RAM), this section organizes reward-based diffusion and flow post-training by the optimization estimator used to inject reward into the model: policy-gradient rollouts, stochastic optimal control/adjoint methods, surrogate objectives, and reward-corrected regression. Preference-only objectives are kept in the separate DPO section below.
Policy Gradient / SDE Rollout
These methods treat the denoising or flow trajectory as an on-policy rollout and optimize reward through PPO, GRPO, RLOO, or related policy-gradient estimators.
Year
Paper
Venue
Task
Method
Resources
2026
Flow-OPD: On-Policy Distillation for Flow Matching Models
arXiv
T2I
Policy Gradient, GRPO
Paper
2026
Embedding-perturbed Exploration Preference Optimization for Flow Models
ICML
T2I
Policy Gradient, GRPO
Paper , ICML
2026
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
ICML
T2V
Policy Gradient, GRPO, Reward Model
Paper , ICML
2026
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
arXiv
T2I
Policy Gradient, GRPO
Paper , Code
2026
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
arXiv
T2V
Policy Gradient, GRPO
Paper , Project , Code
2026
From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space
arXiv
T2I
Policy Gradient, GRPO
Paper
2026
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
ICML
T2I
Policy Gradient, GRPO
Paper , ICML , Code
2026
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
ICLR
T2I
Policy Gradient, GRPO, Reward Model
Paper , OpenReview
2026
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
ICLR
T2I, I2V
Policy Gradient, GRPO
Paper , Project
2026
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design
ICML
T2I
Policy Gradient, Likelihood Estimation
Paper , ICML
2025
Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
arXiv
I2V
Policy Gradient, GRPO, Reward Model
Paper , Project , Code
2025
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
arXiv
T2I
Policy Gradient, SDE Rollout
Paper
2025
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models
NeurIPS
T2I
Policy Gradient
Paper , NeurIPS
2025
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
arXiv
T2I
Policy Gradient, GRPO
Paper , Code
2025
InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
arXiv
T2V
Policy Gradient, GRPO
Paper , Code
2025
DanceGRPO: Unleashing GRPO on Visual Generation
arXiv
Unified
Policy Gradient, GRPO
Paper , Project
2025
Flow-GRPO: Training Flow Matching Models via Online RL
NeurIPS
T2I
Policy Gradient, GRPO
Paper , Project , Code
2025
DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models
AAAI
T2I
Policy Gradient, Exploration
Paper , AAAI
2025
Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
arXiv
T2I
Policy Gradient
Paper
2023
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
NeurIPS
T2I
Policy Gradient, PPO
Paper , Code
2023
Training Diffusion Models with Reinforcement Learning
ICLR
T2I
Policy Gradient, PPO
Paper , Project , Code
Stochastic Optimal Control / Adjoint
These methods formulate reward maximization as stochastic optimal control or adjoint matching. They are more explicit about the KL-regularized control problem, but often require reward gradients, adjoint equations, or careful continuous-time estimation.
Surrogate Objective / Likelihood Approximation
These methods avoid direct likelihood or rollout estimation by optimizing a tractable proxy, such as forward-process contrast, flow-matching ELBOs, variational EM, reward-weighted regression, or density-control surrogates.
Year
Paper
Venue
Task
Method
Resources
2026
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
ICLR Oral
T2I
Surrogate, Forward Process
Paper , ICLR , Project
2026
Diffusion Alignment as Variational Expectation-Maximization
ICLR
T2I
Surrogate, Reward Guidance
Paper , OpenReview , Code
2026
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
arXiv
T2I
Surrogate, GRPO, Likelihood Estimation
Paper
2025
Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning
NeurIPS
T2I
Surrogate, Density Control
Paper , OpenReview
2025
Reinforcing Diffusion Models by Direct Group Preference Optimization
arXiv
T2I
Surrogate, Group Preference
Paper , Code
2025
Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
arXiv
T2I
Surrogate, Score Matching
Paper , Code
2025
ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models
arXiv
T2I
Surrogate, Reward Feedback
Paper , Code
2025
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization
ICLR
T2I
Surrogate, Reward-Weighted Regression
Paper , Project , OpenReview
2025
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
arXiv
T2V
Surrogate, Human Feedback
Paper , Project , Code
2024
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
CVPR
T2I
Surrogate, Reward Model
Paper , Project
Reward-Corrected Regression / Score Matching
These methods preserve the pretraining-style regression or score-matching structure while modifying the target with reward information. RAM is the clearest example of this fourth category in the RAM taxonomy.
DPO / Preference Optimization
Year
Paper
Venue
Task
Method
Resources
2026
ViPO: Visual Preference Optimization at Scale
ICLR
T2I, T2V
DPO
Paper , ICLR , OpenReview
2026
Towards Better Optimization For Listwise Preference in Diffusion Models
ICLR
T2I, I2I
DPO
Paper , OpenReview
2026
Diffusion Negative Preference Optimization Made Simple
ICLR
T2I
DPO
OpenReview
2026
alpha-DPO: Robust Preference Alignment for Diffusion Models via alpha Divergence
ICLR
T2I
DPO
OpenReview , ICLR
2026
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
arXiv
T2I
DPO, RLHF
Paper
2025
Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models
arXiv
T2I
DPO
Paper , Code
2025
CPO: Condition Preference Optimization for Controllable Image Generation
NeurIPS
T2I, I2I
DPO
Paper , OpenReview , Project
2025
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
arXiv
T2V
DPO
Paper
2025
D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples
ICML
T2I, I2I
DPO
Paper , Proceedings
2025
CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation
ICML
T2I
DPO, Reward Guidance
OpenReview , Code
2025
Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
ICML
T2I
DPO
OpenReview
2025
Rethinking DPO-style Diffusion Aligning Frameworks
ICCV Highlight
T2I
DPO
CVF , Code
2025
Self-Supervised Direct Preference Optimization for Text-to-Image Diffusion Models
NeurIPS
T2I
DPO
NeurIPS , OpenReview
2025
A Gradient Guidance Perspective on Stepwise Preference Optimization for Diffusion Models
NeurIPS
T2I
DPO, Reward Guidance
NeurIPS , Code
2025
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
NeurIPS
T2I
DPO, Reward Model
Paper , Code
2025
Boost Your Human Image Generation Model via Direct Preference Optimization
CVPR
T2I
DPO
Paper , CVF
2025
Personalized Preference Fine-tuning of Diffusion Models
CVPR
T2I
DPO
Paper , CVF
2025
Discriminator-Free Direct Preference Optimization for Video Diffusion
arXiv
T2V
DPO
Paper
2025
DSPO: Direct Score Preference Optimization for Diffusion Model Alignment
ICLR
T2I
DPO
OpenReview , Proceedings , Code
2025
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
CVPR
T2I
DPO
Paper , Code
2025
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
arXiv
T2I
DPO
Paper , OpenReview
2024
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
arXiv
T2V
DPO
Paper , Project
2024
OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
arXiv
T2V
DPO
Paper , Project
2024
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
arXiv
T2I
DPO
Paper , Project
2024
Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization
arXiv
T2I
DPO
Paper , Code
2024
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
ICML
T2I
DPO, Reward Model
Paper , PMLR , Code
2024
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
CVPR
T2I
DPO
Paper , Code
2023
Diffusion Model Alignment Using Direct Preference Optimization
CVPR
T2I
DPO
Paper , Code
Reward Guidance / Test-time Optimization
Year
Paper
Venue
Task
Method
Resources
2026
Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning
arXiv
T2V
Reward Guidance
Paper
2026
Scaling Group Inference for Diverse and High-Quality Generation
ICLR
Unified
Reward Guidance
Paper , Project , OpenReview
2026
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation
ICLR
T2I
Reward Guidance
Paper , OpenReview , Code
2025
Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
arXiv
T2V
Reward Guidance
Paper
2025
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
arXiv
T2I
Reward Guidance, DPO
Paper
2025
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
ICLR
T2V
Reward Guidance
Paper , Project , Code
2024
Video Diffusion Alignment via Reward Gradients
arXiv
T2V
Reward Guidance
Paper , Project
2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
arXiv
T2V
Reward Guidance
Paper , Code
2023
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
arXiv
T2I
Reward Guidance
Paper , Project
2023
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
ICLR
T2I
Reward Guidance
Paper
Name
Scope
Task
Method
Links
Flow-Factory
Unified framework for reinforcement learning in flow-matching models
Unified
DPO, GRPO, Toolkit
Code
Flow-GRPO
Official implementation for training flow-matching image generators with online RL
T2I
GRPO, Toolkit
Project , Code
DDPO
Official PyTorch implementation of denoising diffusion policy optimization
T2I
RLHF, PPO, Toolkit
Project , Code
ImageReward
Reward model, dataset, and reward-feedback learning code for text-to-image alignment
T2I
Reward Model, RLHF, Toolkit
Code
Contributions are welcome. Please read CONTRIBUTING.md before adding a paper or resource.