Keywords: Large Language Model, Reasoning, Reinforcement Learning
TL;DR: Our work shows you can improve a powerful, proprietary LLM by training a smaller, open feedback model that learns how to give it specific advice to correct its own mistakes.
Abstract: The optimization of black-box large language models (LLMs) presents significant challenges. While the implementation of pre-existing chain-of-thought (CoT) prompting and feedback mechanisms help occasionally, these approaches still struggle from unreliable feedback, and fail to leverage the training data. In this work, we propose Feedback Reinforcement Learning (FRL)---training a separate feedback model through reinforcement learning to improve the main black-box LLM. FRL divides self-correction into two stages: our trained feedback model identifies error, and generate corresponding feedback on how to correct the error, while the black-box LLM generates correction based on this extra feedback. During training, the feedback model generates feedback rollouts for initial responses from a fixed pretrained model, which then produces revised responses. The improvement between initial and revised responses serves as the reward signal. This approach treats the solver model as a black-box and optimizes it with a separate feedback provider, enabling targeted improvement without modifying the base model. We evaluate FRL on generated Sudoku puzzles, GSM8K, and MMLU-STEM questions, demonstrating consistent improvements over the initial language model's performance by $16.5\%$ on average. Our method outperforms both non-learning self-correction approaches and oracle-based verification methods by leveraging training data through reinforcement learning.
Moreover, FRL models can also function as problem solvers, outperforming their pretrained counterparts, effectively enhancing the model's original reasoning capabilities.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18318
Loading