In-Context Reinforcement Learning Without Optimal Action Labels

Juncheng Dong; Moyang Guo; Ethan X Fang; Zhuoran Yang; Vahid Tarokh

In-Context Reinforcement Learning Without Optimal Action Labels

Juncheng Dong, Moyang Guo, Ethan X Fang, Zhuoran Yang, Vahid Tarokh

Published: 18 Jun 2024, Last Modified: 20 Jul 2024ICML 2024 Workshop ICL PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Reinforcement Learning, In-Context Learning, Large Language Models

Abstract: Large language models (LLMs) have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context Reinforcement Learning (RL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL instances, and then fix and use this transformer to create an action policy for new RL instances. We consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT), which emulates the actor-critic algorithm in an in-context manner. DIT trains a transformer-based policy using a weighted maximum likelihood estimation (WMLE) loss, where the weights are based on the observed rewards and act as importance sampling ratios, guiding the suboptimal policy toward the optimal policy. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the pretraining dataset contains suboptimal action labels.

Submission Number: 22

Loading