LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai; Isadora White; Charlie Victor Snell; Charles Sun; Joey Hong; Yuexiang Zhai; Kelvin Xu; Sergey Levine

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: a benchmark for enabling development of RL algorithms for language tasks

Abstract: Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. Even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, the emergence of complex skills such as persuasion, and long-horizon strategic behavior, such as in the context of games. Enabling this requires the community to develop reliable reinforcement learning algorithms for training LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

Lay Summary: Large language models like ChatGPT are great at writing fluent text, but they often struggle when it comes to making decisions over multiple steps—like holding a strategic conversation, asking good follow-up questions, or planning ahead in a game. This is because these models are not trained to act like agents with goals that unfold over time. Our research tackles this by combining language models with reinforcement learning (RL), a method where agents learn by trial and error to make better decisions, and are rewarded for making good decisions. But to build truly smart and goal-directed language agents, we first need good scenarios to test our agents and reliable ways to measure progress. This is why we built LMRL-Gym, a new benchmark suite that helps researchers train and evaluate language models in tasks that require multiple turns of interaction. These include strategic dialogues, open-ended games, and other challenges where a model has to think ahead and improve over time. By sharing these tasks and training tools openly, we aim to accelerate progress toward language agents that are not just good at generating text, but can act with purpose and improve through experience.

Link To Code: https://github.com/abdulhaim/LMRL-Gym

Primary Area: Reinforcement Learning

Keywords: a benchmark for enabling development of RL algorithms for language tasks

Submission Number: 14551

Loading