Towards Cost-Effective Reward Guided Text Generation

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present an efficient reward model for reward guided text generation which leads to significantly faster inference and more optimal choices.
Abstract: Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training as in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a single call to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
Lay Summary: Can language models improve with the help of human feedback without re-training ? Re-training is expensive as it requires computational resources and electricity consumption, and contributes to carbon emissions. Prior work has shown that it is indeed possible to do so, but comes at the cost of longer response times from the language model, when answering a query. We present a method , FaRMA, that can significantly reduce this response time while still avoid re-training. Moreover, we demonstrate scenarios where the prior methods fail to provide good responses and show that FaRMA is not vulnerable to these.
Link To Code: https://github.com/ahmadrash/FaRMA
Primary Area: Deep Learning->Large Language Models
Keywords: LLM, RLHF, Alignment, Model Efficiency, Reward Models, Sampling
Submission Number: 8143
Loading