Abstract: Reinforcement learning ( RL)-based fine-tuning
has emerged as a powerful approach for aligning diffusion models with black-box objectives.
Proximal policy optimization (PPO) is the most
popular choice of method for policy optimization. While effective in terms of performance,
PPO is highly sensitive to hyper-parameters and
involves substantial computational overhead. REINFORCE, on the other hand, mitigates some
computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to highvariance and sample inefficiency. While the variance of the REINFORCE can be reduced by
sampling multiple actions per input prompt and
using a baseline correction term, it still suffers
from sample inefficiency. To address these challenges, we systematically analyze the efficiencyeffectiveness trade-off between REINFORCE and
PPO, and propose leave-one-out PPO ( LOOP), a
novel RL for diffusion fine-tuning method. LOOP
combines variance reduction techniques from REINFORCE, such as sampling multiple actions per
input prompt and a baseline correction term, with
the robustness and sample efficiency of PPO via
clipping and importance sampling. Our results
demonstrate that LOOP effectively improves diffusion models on various black-box objectives,
and achieves a better balance between computational efficiency and performance.
Loading