Provably Learning from Language Feedback

Published: 01 Jul 2025, Last Modified: 01 Jul 2025RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, sequential decision-making, no-regret learning, bandit
TL;DR: We formalize the learning from language feedback (LLF) problem, assert the assumptions necessary to enable learning despite latent rewards, and develop transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems.
Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce "transfer eluder dimension" as a complexity measure to characterize the hardness of LLF problems. We show that the transfer eluder dimension captures the intuition that information in feedback changes the learning complexity of LLF. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called LLF-UCB, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Our contributions mark a first step towards designing principled agent learning from generic language feedback.
Submission Number: 2
Loading