LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we present a benchmark to pressure-test today’s frontier models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context — from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
Lay Summary: We evaluate how well large language models (LLMs), such as Gemini, ChatGPT, and Claude, can perform in simulated interactive environments (e.g., moving a player to a target in a 2D world) and games (e.g., tic-tac-toe, chess, Atari). Performing well on these tasks requires cognitive skills that are not typically assessed by many standard LLM benchmarks and are relevant to, for example, real-life robotics applications. We test how well LLMs perform in two types of settings: one, how well LLMs do when the only source of information is the data produced by interacting with the environment. And two, whether and how this changes when they are given lots of examples from an expert that they could (in principle) imitate. Our results show that most models did not achieve expert-level performance and that providing additional examples often does not significantly improve their performance. Accordingly, our results indicate that LLMs are not yet fully ready for use in these domains. We share our testbed publicly, allowing other researchers to build upon our work and measure whether and to what extent new methods enhance the capabilities of LLMs.
Link To Code: https://github.com/google-deepmind/lm_act
Primary Area: General Machine Learning->Evaluation
Keywords: in-context learning, imitation learning, multimodal, long context, reasoning, interactive decision-making
Submission Number: 2624
Loading