Skip to content

YuanaHao/Awesome-Diffusion-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome RL for Diffusion

A curated list of papers and resources on reinforcement learning, RLHF, and preference optimization for image and video diffusion models.

This repository focuses on RL for diffusion generation, especially image and video generation. It does not track the separate line of work on diffusion models used as policies, planners, or dynamics models in reinforcement learning.

This list is maintained with weekly updates, with priority given to recent high-impact arXiv papers, papers accepted by top-tier conferences, and influential open-source resources.

Scope

Included:

  • RLHF and policy-gradient optimization for image or video diffusion models.
  • DPO and preference optimization for text-to-image, image-to-image, text-to-video, image-to-video, and video-to-video generation.
  • GRPO, RLOO, PPO-style, and related policy optimization methods for diffusion post-training.
  • Stochastic optimal control, surrogate-objective, and reward-corrected regression methods for diffusion or flow post-training.
  • Reward models, feedback signals, and benchmarks used to optimize or evaluate image/video diffusion outputs.
  • Reward guidance and test-time optimization methods when they directly target image or video diffusion alignment.

Out of scope:

  • Diffusion models used as RL policies, planners, or dynamics models.
  • General LLM RLHF papers without a direct image/video diffusion application.
  • Audio, 3D, robotics, molecule, or text diffusion work, except as brief related context.

Contents

Tag Guide

Task tags: T2I, I2I, T2V, I2V, V2V, Unified

Method tags: RLHF, Policy Gradient, PPO, GRPO, RLOO, SOC, Adjoint, Surrogate, Likelihood Estimation, Reward-Corrected Regression, Score Matching, DPO, Reward Model, Reward Guidance, Benchmark, Toolkit

Surveys & Overviews

Year Paper Venue Task Method Resources
2026 Advances in GRPO for Generation Models: A Survey arXiv Unified GRPO Paper
2025 Reinforcement Learning for Large Model: A Survey arXiv Unified RLHF, GRPO Paper, Code

Reward Models & Feedback Signals

Year Paper Venue Task Method Resources
2026 VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment arXiv T2V Reward Model, Reward Guidance Paper
2026 Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling arXiv T2I Reward Model Paper
2025 VideoScore2: Think before You Score in Generative Video Evaluation arXiv T2V Reward Model, GRPO Paper, Project
2025 Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences ICCV T2I Reward Model, DPO Paper, CVF, Project
2025 Improving Video Generation with Human Feedback NeurIPS T2V Reward Model, RLHF Paper, Project, Code
2024 VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation arXiv Unified Reward Model Paper, Code
2024 VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation arXiv T2V Reward Model Paper
2023 ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation NeurIPS T2I Reward Model, RLHF Paper, Code
2023 Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation NeurIPS T2I Reward Model Paper, Code

RL Post-Training Taxonomy

Following the taxonomy used by Reinforce Adjoint Matching (RAM), this section organizes reward-based diffusion and flow post-training by the optimization estimator used to inject reward into the model: policy-gradient rollouts, stochastic optimal control/adjoint methods, surrogate objectives, and reward-corrected regression. Preference-only objectives are kept in the separate DPO section below.

Policy Gradient / SDE Rollout

These methods treat the denoising or flow trajectory as an on-policy rollout and optimize reward through PPO, GRPO, RLOO, or related policy-gradient estimators.

Year Paper Venue Task Method Resources
2026 Flow-OPD: On-Policy Distillation for Flow Matching Models arXiv T2I Policy Gradient, GRPO Paper
2026 Embedding-perturbed Exploration Preference Optimization for Flow Models ICML T2I Policy Gradient, GRPO Paper, ICML
2026 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation ICML T2V Policy Gradient, GRPO, Reward Model Paper, ICML
2026 MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation arXiv T2I Policy Gradient, GRPO Paper, Code
2026 Manifold-Aware Exploration for Reinforcement Learning in Video Generation arXiv T2V Policy Gradient, GRPO Paper, Project, Code
2026 From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space arXiv T2I Policy Gradient, GRPO Paper
2026 Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO ICML T2I Policy Gradient, GRPO Paper, ICML, Code
2026 DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment ICLR T2I Policy Gradient, GRPO, Reward Model Paper, OpenReview
2026 BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models ICLR T2I, I2V Policy Gradient, GRPO Paper, Project
2026 Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design ICML T2I Policy Gradient, Likelihood Estimation Paper, ICML
2025 Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning arXiv I2V Policy Gradient, GRPO, Reward Model Paper, Project, Code
2025 Understanding Sampler Stochasticity in Training Diffusion Models for RLHF arXiv T2I Policy Gradient, SDE Rollout Paper
2025 Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models NeurIPS T2I Policy Gradient Paper, NeurIPS
2025 MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE arXiv T2I Policy Gradient, GRPO Paper, Code
2025 InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO arXiv T2V Policy Gradient, GRPO Paper, Code
2025 DanceGRPO: Unleashing GRPO on Visual Generation arXiv Unified Policy Gradient, GRPO Paper, Project
2025 Flow-GRPO: Training Flow Matching Models via Online RL NeurIPS T2I Policy Gradient, GRPO Paper, Project, Code
2025 DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models AAAI T2I Policy Gradient, Exploration Paper, AAAI
2025 Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning arXiv T2I Policy Gradient Paper
2023 DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models NeurIPS T2I Policy Gradient, PPO Paper, Code
2023 Training Diffusion Models with Reinforcement Learning ICLR T2I Policy Gradient, PPO Paper, Project, Code

Stochastic Optimal Control / Adjoint

These methods formulate reward maximization as stochastic optimal control or adjoint matching. They are more explicit about the KL-regularized control problem, but often require reward gradients, adjoint equations, or careful continuous-time estimation.

Year Paper Venue Task Method Resources
2026 Efficient Adjoint Matching for Fine-tuning Diffusion Models arXiv T2I SOC, Adjoint Paper
2026 A unified perspective on fine-tuning and sampling with diffusion and flow models arXiv Unified SOC, Adjoint Paper
2025 Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning arXiv T2I SOC, Policy Gradient Paper
2025 Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control ICLR Unified SOC, Adjoint Paper, OpenReview
2024 Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control ICML T2I SOC, Reward Guidance Paper, PMLR

Surrogate Objective / Likelihood Approximation

These methods avoid direct likelihood or rollout estimation by optimizing a tractable proxy, such as forward-process contrast, flow-matching ELBOs, variational EM, reward-weighted regression, or density-control surrogates.

Year Paper Venue Task Method Resources
2026 DiffusionNFT: Online Diffusion Reinforcement with Forward Process ICLR Oral T2I Surrogate, Forward Process Paper, ICLR, Project
2026 Diffusion Alignment as Variational Expectation-Maximization ICLR T2I Surrogate, Reward Guidance Paper, OpenReview, Code
2026 V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think arXiv T2I Surrogate, GRPO, Likelihood Estimation Paper
2025 Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning NeurIPS T2I Surrogate, Density Control Paper, OpenReview
2025 Reinforcing Diffusion Models by Direct Group Preference Optimization arXiv T2I Surrogate, Group Preference Paper, Code
2025 Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models arXiv T2I Surrogate, Score Matching Paper, Code
2025 ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models arXiv T2I Surrogate, Reward Feedback Paper, Code
2025 Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization ICLR T2I Surrogate, Reward-Weighted Regression Paper, Project, OpenReview
2025 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment arXiv T2V Surrogate, Human Feedback Paper, Project, Code
2024 PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models CVPR T2I Surrogate, Reward Model Paper, Project

Reward-Corrected Regression / Score Matching

These methods preserve the pretraining-style regression or score-matching structure while modifying the target with reward information. RAM is the clearest example of this fourth category in the RAM taxonomy.

Year Paper Venue Task Method Resources
2026 Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models arXiv T2I Reward-Corrected Regression, SOC Paper
2026 Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models arXiv T2I Reward-Corrected Regression, Score Matching Paper

DPO / Preference Optimization

Year Paper Venue Task Method Resources
2026 ViPO: Visual Preference Optimization at Scale ICLR T2I, T2V DPO Paper, ICLR, OpenReview
2026 Towards Better Optimization For Listwise Preference in Diffusion Models ICLR T2I, I2I DPO Paper, OpenReview
2026 Diffusion Negative Preference Optimization Made Simple ICLR T2I DPO OpenReview
2026 alpha-DPO: Robust Preference Alignment for Diffusion Models via alpha Divergence ICLR T2I DPO OpenReview, ICLR
2026 Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning arXiv T2I DPO, RLHF Paper
2025 Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models arXiv T2I DPO Paper, Code
2025 CPO: Condition Preference Optimization for Controllable Image Generation NeurIPS T2I, I2I DPO Paper, OpenReview, Project
2025 DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models arXiv T2V DPO Paper
2025 D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples ICML T2I, I2I DPO Paper, Proceedings
2025 CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation ICML T2I DPO, Reward Guidance OpenReview, Code
2025 Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences ICML T2I DPO OpenReview
2025 Rethinking DPO-style Diffusion Aligning Frameworks ICCV Highlight T2I DPO CVF, Code
2025 Self-Supervised Direct Preference Optimization for Text-to-Image Diffusion Models NeurIPS T2I DPO NeurIPS, OpenReview
2025 A Gradient Guidance Perspective on Stepwise Preference Optimization for Diffusion Models NeurIPS T2I DPO, Reward Guidance NeurIPS, Code
2025 Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization NeurIPS T2I DPO, Reward Model Paper, Code
2025 Boost Your Human Image Generation Model via Direct Preference Optimization CVPR T2I DPO Paper, CVF
2025 Personalized Preference Fine-tuning of Diffusion Models CVPR T2I DPO Paper, CVF
2025 Discriminator-Free Direct Preference Optimization for Video Diffusion arXiv T2V DPO Paper
2025 DSPO: Direct Score Preference Optimization for Diffusion Model Alignment ICLR T2I DPO OpenReview, Proceedings, Code
2025 Curriculum Direct Preference Optimization for Diffusion and Consistency Models CVPR T2I DPO Paper, Code
2025 Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking arXiv T2I DPO Paper, OpenReview
2024 VideoDPO: Omni-Preference Alignment for Video Diffusion Generation arXiv T2V DPO Paper, Project
2024 OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization arXiv T2V DPO Paper, Project
2024 Margin-aware Preference Optimization for Aligning Diffusion Models without Reference arXiv T2I DPO Paper, Project
2024 Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization arXiv T2I DPO Paper, Code
2024 A Dense Reward View on Aligning Text-to-Image Diffusion with Preference ICML T2I DPO, Reward Model Paper, PMLR, Code
2024 Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model CVPR T2I DPO Paper, Code
2023 Diffusion Model Alignment Using Direct Preference Optimization CVPR T2I DPO Paper, Code

Reward Guidance / Test-time Optimization

Year Paper Venue Task Method Resources
2026 Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning arXiv T2V Reward Guidance Paper
2026 Scaling Group Inference for Diverse and High-Quality Generation ICLR Unified Reward Guidance Paper, Project, OpenReview
2026 Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation ICLR T2I Reward Guidance Paper, OpenReview, Code
2025 Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search arXiv T2V Reward Guidance Paper
2025 Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference arXiv T2I Reward Guidance, DPO Paper
2025 T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design ICLR T2V Reward Guidance Paper, Project, Code
2024 Video Diffusion Alignment via Reward Gradients arXiv T2V Reward Guidance Paper, Project
2024 T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback arXiv T2V Reward Guidance Paper, Code
2023 Aligning Text-to-Image Diffusion Models with Reward Backpropagation arXiv T2I Reward Guidance Paper, Project
2023 Directly Fine-Tuning Diffusion Models on Differentiable Rewards ICLR T2I Reward Guidance Paper

Benchmarks & Evaluation

Year Paper Venue Task Method Resources
2024 VBench: Comprehensive Benchmark Suite for Video Generative Models CVPR T2V, I2V Benchmark Paper, Project, Code
2023 GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment NeurIPS T2I Benchmark Paper, Code
2023 Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis arXiv T2I Benchmark, Reward Model Paper, Code
2023 Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation NeurIPS T2I Benchmark, Reward Model Paper, Code

Toolkits & Codebases

Name Scope Task Method Links
Flow-Factory Unified framework for reinforcement learning in flow-matching models Unified DPO, GRPO, Toolkit Code
Flow-GRPO Official implementation for training flow-matching image generators with online RL T2I GRPO, Toolkit Project, Code
DDPO Official PyTorch implementation of denoising diffusion policy optimization T2I RLHF, PPO, Toolkit Project, Code
ImageReward Reward model, dataset, and reward-feedback learning code for text-to-image alignment T2I Reward Model, RLHF, Toolkit Code

Related Areas

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before adding a paper or resource.

About

A weekly updated awesome list of RL, RLHF, DPO, GRPO, reward models, and preference optimization for image and video diffusion generation.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors