Value Augmented Sampling: Predict Your Rewards To Align Language Models

Seungwook Han; Idan Shenfeld; Akash Srivastava; Yoon Kim; Pulkit Agrawal

Value Augmented Sampling: Predict Your Rewards To Align Language Models

Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit Agrawal

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: alignment, reinforcement learning, preference learning, language models, LLM

TL;DR: Augmenting an LLM with a Value estimator for alignment and reward optimization

Abstract: With Large Language Models (LLMs) learning a plethora of behavior from Internet data, it has become ever more important to adapt and align these models to cater to different human preferences, learn new skills, and unlearn harmful behavior. Currently, there exists a dichotomy of solutions, each with its own set of advantages and disadvantages. Search-based methods, such as Best-of-N and Monte Carlo Tree Search, are effective but expensive during inference and, therefore, infeasible in practice. RL-based methods, such as Proximal Policy Optimization, are computationally efficient, but are not competitive in performance. To this end, we propose a novel framework for reward optimization, Value Augmented Sampling (\ourmethod), to effectively adapt LLMs at inference-time, while significantly reducing the computational cost compared to existing search-based methods. At the heart of our method lies the idea of augmenting the LLM's output distribution with expected reward estimates to guide the generation process toward high-reward responses at the token level. Our method outperforms established baselines, such as PPO and DPO, in improving summarization and multi-turn chat dialogue and achieves comparable results to Best-of-128 with lower inference costs. We also demonstrate the capability to align a closed-source, proprietary model (such as GPT-3.5) on learning to use a new API tool.

Submission Number: 65

Loading