Keywords: Language Models, Verbal Feedback, Reinforcement Learning, Feedback-Conditional Policy
TL;DR: We introduce Feedback-Conditional Policy (FCP), a simple and scalable paradigm that directly learns from verbal feedback without scalar rewards.
Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically *compress nuanced feedback into scalar rewards*, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the **feedback-conditional policy (FCP)**. FCP learns directly from response–feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on *offline* data. We further develop an *online bootstrapping* stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7689
Loading