LLF-Bench: A Benchmark for Interactive Learning from Language Feedback

Ching-An Cheng; Andrey Kolobov; Dipendra Misra; Allen Nie; Adith Swaminathan

LLF-Bench: A Benchmark for Interactive Learning from Language Feedback

Ching-An Cheng, Andrey Kolobov, Dipendra Misra, Allen Nie, Adith Swaminathan

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, benchmark, decision making, reinforcement learning

TL;DR: LLF-Bench is a diverse collection of sequential decision-making tasks for assessing an agent's ability to learn to solve multi-step problems based on natural-language feedback.

Abstract: We introduce a new benchmark, LLF-Bench (Learning from Language Feedback Benchmark; pronounced as ``elf-bench''), to evaluate the ability of AI agents to interactively learn from natural language feedback and instructions. Learning from language feedback (LLF) is essential for people, largely because the rich information this feedback provides can help a learner avoid much of trial and error and thereby speed up the learning process. Large Language Models (LLMs) have recently enabled AI agents to comprehend natural language --- and hence AI agents can potentially benefit from language feedback during learning like humans do. But existing interactive benchmarks do not assess this crucial capability: they either use numeric reward feedback or require no learning at all (only planning or information retrieval). LLF-Bench is designed to fill this omission. LLF-Bench is a diverse collection of sequential decision-making tasks that includes user recommendation, poem writing, navigation, and robot control. The objective of an agent is to interactively solve these tasks based on their natural-language instructions and the feedback received after taking actions. Crucially, to ensure that the agent actually learns from the feedback, LLF-Bench implements several randomization techniques to ensure that the task isn't familiar to the agent and that the agent is robust to various verbalizations. In addition, LLF-Bench allows configuring different types of feedback to study how agents respond to them. Together, these features make LLF-Bench a unique research platform for developing and testing LLF agents.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13615

Loading