Abstract: Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses a *discrete-time* formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers.
The objective of this study is to develop a disciplined approach to fine-tuning diffusion models using *continuous-time* RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.
Lay Summary: We often rely on feedback from humans to teach image-generation models how to produce results that match a user’s request. Current methods break down this teaching process into a fixed number of steps, which can introduce mistakes and can’t always work with newer, more flexible solvers. In our work, we propose a new approach that treats the model’s “denoising” steps as continuous actions, rather than a rigid sequence. By framing the problem this way—like guiding a car along a smooth path rather than handing it a checklist of directions—we can use well-known techniques from continuous-time decision making to more precisely steer the model toward the user’s prompt. To test this idea, we fine-tuned a popular text-to-image model (Stable Diffusion v1.5) using our continuous framework. The result is a model that better adapts to feedback and generates images more faithfully aligned with what users want, while avoiding the pitfalls that discrete-step methods often encounter. This makes the process of teaching large-scale generative models more robust and broadly applicable.
Primary Area: Reinforcement Learning
Keywords: Continuous-time Reinforcement Learning, Diffusion Models Fine-tuning, Reinforcement Learning from Human Feedback
Submission Number: 5515
Loading