Keywords: language feedback, no-regret learning, hypothesis testing, large language models
TL;DR: We formalize the learning from language feedback (LLF) problem, assert the assumptions necessary to enable learning despite latent rewards, and develop transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems.
Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce *transfer eluder dimension* as a measure to characterize the hardness of LLF problems.
We formalize the intuition that information in the feedback governs the learning complexity of LLF problems. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called `HeLiX`, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that `HeLiX` performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms from generic language feedback.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12812
Loading