Awesome RL for Diffusion

A curated list of papers and resources on reinforcement learning, RLHF, and preference optimization for image and video diffusion models.

This repository focuses on RL for diffusion generation, especially image and video generation. It does not track the separate line of work on diffusion models used as policies, planners, or dynamics models in reinforcement learning.

This list is maintained with weekly updates, with priority given to recent high-impact arXiv papers, papers accepted by top-tier conferences, and influential open-source resources.

Scope

Included:

RLHF and policy-gradient optimization for image or video diffusion models.
DPO and preference optimization for text-to-image, image-to-image, text-to-video, image-to-video, and video-to-video generation.
GRPO, RLOO, PPO-style, and related policy optimization methods for diffusion post-training.
Stochastic optimal control, surrogate-objective, and reward-corrected regression methods for diffusion or flow post-training.
Reward models, feedback signals, and benchmarks used to optimize or evaluate image/video diffusion outputs.
Reward guidance and test-time optimization methods when they directly target image or video diffusion alignment.

Out of scope:

Diffusion models used as RL policies, planners, or dynamics models.
General LLM RLHF papers without a direct image/video diffusion application.
Audio, 3D, robotics, molecule, or text diffusion work, except as brief related context.

Tag Guide

Task tags: T2I, I2I, T2V, I2V, V2V, Unified

Method tags: RLHF, Policy Gradient, PPO, GRPO, RLOO, SOC, Adjoint, Surrogate, Likelihood Estimation, Reward-Corrected Regression, Score Matching, DPO, Reward Model, Reward Guidance, Benchmark, Toolkit

Surveys & Overviews

Year	Paper	Venue	Task	Method	Resources
2026	Advances in GRPO for Generation Models: A Survey	arXiv	Unified	GRPO	Paper
2025	Reinforcement Learning for Large Model: A Survey	arXiv	Unified	RLHF, GRPO	Paper, Code

Reward Models & Feedback Signals

Year	Paper	Venue	Task	Method	Resources
2026	VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment	arXiv	T2V	Reward Model, Reward Guidance	Paper
2026	Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling	arXiv	T2I	Reward Model	Paper
2025	VideoScore2: Think before You Score in Generative Video Evaluation	arXiv	T2V	Reward Model, GRPO	Paper, Project
2025	Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences	ICCV	T2I	Reward Model, DPO	Paper, CVF, Project
2025	Improving Video Generation with Human Feedback	NeurIPS	T2V	Reward Model, RLHF	Paper, Project, Code
2024	VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation	arXiv	Unified	Reward Model	Paper, Code
2024	VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation	arXiv	T2V	Reward Model	Paper
2023	ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation	NeurIPS	T2I	Reward Model, RLHF	Paper, Code
2023	Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation	NeurIPS	T2I	Reward Model	Paper, Code

RL Post-Training Taxonomy

Following the taxonomy used by Reinforce Adjoint Matching (RAM), this section organizes reward-based diffusion and flow post-training by the optimization estimator used to inject reward into the model: policy-gradient rollouts, stochastic optimal control/adjoint methods, surrogate objectives, and reward-corrected regression. Preference-only objectives are kept in the separate DPO section below.

Policy Gradient / SDE Rollout

These methods treat the denoising or flow trajectory as an on-policy rollout and optimize reward through PPO, GRPO, RLOO, or related policy-gradient estimators.

Year	Paper	Venue	Task	Method	Resources
2026	Flow-OPD: On-Policy Distillation for Flow Matching Models	arXiv	T2I	Policy Gradient, GRPO	Paper
2026	Embedding-perturbed Exploration Preference Optimization for Flow Models	ICML	T2I	Policy Gradient, GRPO	Paper, ICML
2026	World-R1: Reinforcing 3D Constraints for Text-to-Video Generation	ICML	T2V	Policy Gradient, GRPO, Reward Model	Paper, ICML
2026	MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation	arXiv	T2I	Policy Gradient, GRPO	Paper, Code
2026	Manifold-Aware Exploration for Reinforcement Learning in Video Generation	arXiv	T2V	Policy Gradient, GRPO	Paper, Project, Code
2026	From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space	arXiv	T2I	Policy Gradient, GRPO	Paper
2026	Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO	ICML	T2I	Policy Gradient, GRPO	Paper, ICML, Code
2026	DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment	ICLR	T2I	Policy Gradient, GRPO, Reward Model	Paper, OpenReview
2026	BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models	ICLR	T2I, I2V	Policy Gradient, GRPO	Paper, Project
2026	Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design	ICML	T2I	Policy Gradient, Likelihood Estimation	Paper, ICML
2025	Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning	arXiv	I2V	Policy Gradient, GRPO, Reward Model	Paper, Project, Code
2025	Understanding Sampler Stochasticity in Training Diffusion Models for RLHF	arXiv	T2I	Policy Gradient, SDE Rollout	Paper
2025	Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models	NeurIPS	T2I	Policy Gradient	Paper, NeurIPS
2025	MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE	arXiv	T2I	Policy Gradient, GRPO	Paper, Code
2025	InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO	arXiv	T2V	Policy Gradient, GRPO	Paper, Code
2025	DanceGRPO: Unleashing GRPO on Visual Generation	arXiv	Unified	Policy Gradient, GRPO	Paper, Project
2025	Flow-GRPO: Training Flow Matching Models via Online RL	NeurIPS	T2I	Policy Gradient, GRPO	Paper, Project, Code
2025	DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models	AAAI	T2I	Policy Gradient, Exploration	Paper, AAAI
2025	Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning	arXiv	T2I	Policy Gradient	Paper
2023	DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models	NeurIPS	T2I	Policy Gradient, PPO	Paper, Code
2023	Training Diffusion Models with Reinforcement Learning	ICLR	T2I	Policy Gradient, PPO	Paper, Project, Code

Stochastic Optimal Control / Adjoint

These methods formulate reward maximization as stochastic optimal control or adjoint matching. They are more explicit about the KL-regularized control problem, but often require reward gradients, adjoint equations, or careful continuous-time estimation.

Year	Paper	Venue	Task	Method	Resources
2026	Efficient Adjoint Matching for Fine-tuning Diffusion Models	arXiv	T2I	SOC, Adjoint	Paper
2026	A unified perspective on fine-tuning and sampling with diffusion and flow models	arXiv	Unified	SOC, Adjoint	Paper
2025	Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning	arXiv	T2I	SOC, Policy Gradient	Paper
2025	Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control	ICLR	Unified	SOC, Adjoint	Paper, OpenReview
2024	Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control	ICML	T2I	SOC, Reward Guidance	Paper, PMLR

Surrogate Objective / Likelihood Approximation

These methods avoid direct likelihood or rollout estimation by optimizing a tractable proxy, such as forward-process contrast, flow-matching ELBOs, variational EM, reward-weighted regression, or density-control surrogates.

Year	Paper	Venue	Task	Method	Resources
2026	DiffusionNFT: Online Diffusion Reinforcement with Forward Process	ICLR Oral	T2I	Surrogate, Forward Process	Paper, ICLR, Project
2026	Diffusion Alignment as Variational Expectation-Maximization	ICLR	T2I	Surrogate, Reward Guidance	Paper, OpenReview, Code
2026	V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think	arXiv	T2I	Surrogate, GRPO, Likelihood Estimation	Paper
2025	Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning	NeurIPS	T2I	Surrogate, Density Control	Paper, OpenReview
2025	Reinforcing Diffusion Models by Direct Group Preference Optimization	arXiv	T2I	Surrogate, Group Preference	Paper, Code
2025	Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models	arXiv	T2I	Surrogate, Score Matching	Paper, Code
2025	ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models	arXiv	T2I	Surrogate, Reward Feedback	Paper, Code
2025	Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization	ICLR	T2I	Surrogate, Reward-Weighted Regression	Paper, Project, OpenReview
2025	LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment	arXiv	T2V	Surrogate, Human Feedback	Paper, Project, Code
2024	PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models	CVPR	T2I	Surrogate, Reward Model	Paper, Project

Reward-Corrected Regression / Score Matching

These methods preserve the pretraining-style regression or score-matching structure while modifying the target with reward information. RAM is the clearest example of this fourth category in the RAM taxonomy.

Year	Paper	Venue	Task	Method	Resources
2026	Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models	arXiv	T2I	Reward-Corrected Regression, SOC	Paper
2026	Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models	arXiv	T2I	Reward-Corrected Regression, Score Matching	Paper

DPO / Preference Optimization

Year	Paper	Venue	Task	Method	Resources
2026	ViPO: Visual Preference Optimization at Scale	ICLR	T2I, T2V	DPO	Paper, ICLR, OpenReview
2026	Towards Better Optimization For Listwise Preference in Diffusion Models	ICLR	T2I, I2I	DPO	Paper, OpenReview
2026	Diffusion Negative Preference Optimization Made Simple	ICLR	T2I	DPO	OpenReview
2026	alpha-DPO: Robust Preference Alignment for Diffusion Models via alpha Divergence	ICLR	T2I	DPO	OpenReview, ICLR
2026	Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning	arXiv	T2I	DPO, RLHF	Paper
2025	Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models	arXiv	T2I	DPO	Paper, Code
2025	CPO: Condition Preference Optimization for Controllable Image Generation	NeurIPS	T2I, I2I	DPO	Paper, OpenReview, Project
2025	DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models	arXiv	T2V	DPO	Paper
2025	D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples	ICML	T2I, I2I	DPO	Paper, Proceedings
2025	CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation	ICML	T2I	DPO, Reward Guidance	OpenReview, Code
2025	Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences	ICML	T2I	DPO	OpenReview
2025	Rethinking DPO-style Diffusion Aligning Frameworks	ICCV Highlight	T2I	DPO	CVF, Code
2025	Self-Supervised Direct Preference Optimization for Text-to-Image Diffusion Models	NeurIPS	T2I	DPO	NeurIPS, OpenReview
2025	A Gradient Guidance Perspective on Stepwise Preference Optimization for Diffusion Models	NeurIPS	T2I	DPO, Reward Guidance	NeurIPS, Code
2025	Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization	NeurIPS	T2I	DPO, Reward Model	Paper, Code
2025	Boost Your Human Image Generation Model via Direct Preference Optimization	CVPR	T2I	DPO	Paper, CVF
2025	Personalized Preference Fine-tuning of Diffusion Models	CVPR	T2I	DPO	Paper, CVF
2025	Discriminator-Free Direct Preference Optimization for Video Diffusion	arXiv	T2V	DPO	Paper
2025	DSPO: Direct Score Preference Optimization for Diffusion Model Alignment	ICLR	T2I	DPO	OpenReview, Proceedings, Code
2025	Curriculum Direct Preference Optimization for Diffusion and Consistency Models	CVPR	T2I	DPO	Paper, Code
2025	Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking	arXiv	T2I	DPO	Paper, OpenReview
2024	VideoDPO: Omni-Preference Alignment for Video Diffusion Generation	arXiv	T2V	DPO	Paper, Project
2024	OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization	arXiv	T2V	DPO	Paper, Project
2024	Margin-aware Preference Optimization for Aligning Diffusion Models without Reference	arXiv	T2I	DPO	Paper, Project
2024	Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization	arXiv	T2I	DPO	Paper, Code
2024	A Dense Reward View on Aligning Text-to-Image Diffusion with Preference	ICML	T2I	DPO, Reward Model	Paper, PMLR, Code
2024	Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model	CVPR	T2I	DPO	Paper, Code
2023	Diffusion Model Alignment Using Direct Preference Optimization	CVPR	T2I	DPO	Paper, Code

Reward Guidance / Test-time Optimization

Year	Paper	Venue	Task	Method	Resources
2026	Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning	arXiv	T2V	Reward Guidance	Paper
2026	Scaling Group Inference for Diverse and High-Quality Generation	ICLR	Unified	Reward Guidance	Paper, Project, OpenReview
2026	Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation	ICLR	T2I	Reward Guidance	Paper, OpenReview, Code
2025	Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search	arXiv	T2V	Reward Guidance	Paper
2025	Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference	arXiv	T2I	Reward Guidance, DPO	Paper
2025	T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design	ICLR	T2V	Reward Guidance	Paper, Project, Code
2024	Video Diffusion Alignment via Reward Gradients	arXiv	T2V	Reward Guidance	Paper, Project
2024	T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback	arXiv	T2V	Reward Guidance	Paper, Code
2023	Aligning Text-to-Image Diffusion Models with Reward Backpropagation	arXiv	T2I	Reward Guidance	Paper, Project
2023	Directly Fine-Tuning Diffusion Models on Differentiable Rewards	ICLR	T2I	Reward Guidance	Paper

Benchmarks & Evaluation

Year	Paper	Venue	Task	Method	Resources
2024	VBench: Comprehensive Benchmark Suite for Video Generative Models	CVPR	T2V, I2V	Benchmark	Paper, Project, Code
2023	GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment	NeurIPS	T2I	Benchmark	Paper, Code
2023	Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis	arXiv	T2I	Benchmark, Reward Model	Paper, Code
2023	Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation	NeurIPS	T2I	Benchmark, Reward Model	Paper, Code

Toolkits & Codebases

Name	Scope	Task	Method	Links
Flow-Factory	Unified framework for reinforcement learning in flow-matching models	Unified	DPO, GRPO, Toolkit	Code
Flow-GRPO	Official implementation for training flow-matching image generators with online RL	T2I	GRPO, Toolkit	Project, Code
DDPO	Official PyTorch implementation of denoising diffusion policy optimization	T2I	RLHF, PPO, Toolkit	Project, Code
ImageReward	Reward model, dataset, and reward-feedback learning code for text-to-image alignment	T2I	Reward Model, RLHF, Toolkit	Code

Related Areas

awesome-diffusion-model-in-rl: diffusion models for reinforcement learning, a related but different direction.
Awesome-RLHF-Video-Diffusion: focused list for RLHF in video diffusion.
awesome-diffusion-rl: related list covering RL for diffusion and diffusion-related post-training.

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before adding a paper or resource.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome RL for Diffusion

Scope

Contents

Tag Guide

Surveys & Overviews

Reward Models & Feedback Signals

RL Post-Training Taxonomy

Policy Gradient / SDE Rollout

Stochastic Optimal Control / Adjoint

Surrogate Objective / Likelihood Approximation

Reward-Corrected Regression / Score Matching

DPO / Preference Optimization

Reward Guidance / Test-time Optimization

Benchmarks & Evaluation

Toolkits & Codebases

Related Areas

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome RL for Diffusion

Scope

Contents

Tag Guide

Surveys & Overviews

Reward Models & Feedback Signals

RL Post-Training Taxonomy

Policy Gradient / SDE Rollout

Stochastic Optimal Control / Adjoint

Surrogate Objective / Likelihood Approximation

Reward-Corrected Regression / Score Matching

DPO / Preference Optimization

Reward Guidance / Test-time Optimization

Benchmarks & Evaluation

Toolkits & Codebases

Related Areas

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages