In-Context Reinforcement Learning From Suboptimal Historical Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a supervised pretraining framework for in-context RL without optimal action labels
Abstract: Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the *Decision Importance Transformer* (DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
Lay Summary: In-context reinforcement learning (ICRL) has garnered increasing attention for its ability to enable transformer models to rapidly adapt to unseen RL tasks using only a few trajectories from those tasks. However, existing ICRL approaches typically rely on large quantities of high-quality pretraining data, limiting their practical applicability. To address this, we propose a framework that leverages only suboptimal historical data—readily available in the era of big data—for pretraining transformer models for ICRL. This significantly improves the practicality and scalability of ICRL. Central to our approach is a weighted supervised pretraining objective, where the weights are derived from an in-context advantage estimator that evaluates the quality of actions in the historical dataset. Empirically, our method yields transformer models with strong ICRL capabilities across both challenging discrete navigation and complex continuous control tasks.
Primary Area: Reinforcement Learning
Keywords: In-Context Learning; Reinforcement Learning; Transformers; In-Context Reinforcement Learning
Submission Number: 8713
Loading