Rejection Sampling Based Fine Tuning Secretly Performs PPO

Gautham Govind Anil; Dheeraj Mysore Nagaraj; Karthikeyan Shanmugam; Sanjay Shakkottai

Rejection Sampling Based Fine Tuning Secretly Performs PPO

Gautham Govind Anil, Dheeraj Mysore Nagaraj, Karthikeyan Shanmugam, Sanjay Shakkottai

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: fine-tuning, diffusion

TL;DR: Exact PPO for Diffusion Models can be done through rejection sampling based fine-tuning; this can be made more efficient by fine-tuning only earlier denoising timesteps.

Abstract: Several downstream applications of pre-trained generative models require task-specific adaptations based on reward feedback. In this work, we examine strategies to fine-tune a pre-trained model given non-differentiable rewards on generations. We establish connections between Rejection Sampling based fine-tuning and Proximal Policy Optimization (PPO) - we use this formalism to establish PPO with marginal KL constraints for diffusion models. A framework for intermediate denoising step fine-tuning is then proposed for more sample-efficient fine-tuning of diffusion models. Experimental results are presented on the tasks of layout generation and molecule generation to validate the claims.

Submission Number: 65

Loading