In-Context Multi-Armed Bandits via Supervised Pretraining

Published: 07 Nov 2023, Last Modified: 06 Dec 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX
Keywords: in-context learning, transformers, foundation models, multi-armed bandits, reinforcement learning
TL;DR: We remove the critical assumption of sampling from the optimal policy in the work of Lee et al.
Abstract: Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data.
Submission Number: 102