Rejection Sampling Based Fine Tuning Secretly Performs PPO

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: fine-tuning, diffusion
TL;DR: Exact PPO for Diffusion Models can be done through rejection sampling based fine-tuning; this can be made more efficient by fine-tuning only earlier denoising timesteps.
Abstract: Several downstream applications of pre-trained generative models require task-specific adaptations based on reward feedback. In this work, we examine strategies to fine-tune a pre-trained model given non-differentiable rewards on generations. We establish connections between Rejection Sampling based fine-tuning and Proximal Policy Optimization (PPO) - we use this formalism to establish PPO with marginal KL constraints for diffusion models. A framework for intermediate denoising step fine-tuning is then proposed for more sample-efficient fine-tuning of diffusion models. Experimental results are presented on the tasks of layout generation and molecule generation to validate the claims.
Submission Number: 65
Loading