Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

ICLR 2026 Conference Submission7689 Authors

16 Sept 2025 (modified: 26 Jan 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Models, Verbal Feedback, Reinforcement Learning, Feedback-Conditional Policy

TL;DR: We introduce Feedback-Conditional Policy (FCP), a simple and scalable paradigm that directly learns from verbal feedback without scalar rewards.

Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically *compress nuanced feedback into scalar rewards*, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the **feedback-conditional policy (FCP)**. FCP learns directly from response–feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on *offline* data. We further develop an *online bootstrapping* stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7689

Loading