Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Published: 29 Jun 2023, Last Modified: 04 Oct 2023MFPL OralEveryoneRevisionsBibTeX
Keywords: Learng from human feedback, zeroth-order optimization, Stable Diffusion, ranking and preferences
TL;DR: We invent the first zeroth-order algorithm for solving optimization problems with only ranking oracles of the objective function available.
Abstract: In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle—a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. A prominent instance of such a situation is Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance \citep{ouyang2022training,liu2023languages,chatgpt,bai2022training}. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. We demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.
Submission Number: 21
Loading