Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee; Josiah P. Hanna; Qiaomin Xie; Robert D Nowak

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert D Nowak

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Structured Bandit, Multi-task Learning, Decision Transformer

Abstract: In this paper, we study the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and any optimal algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure so as to generalize to the unseen test task. The prior work of pretrained decision transformers like DPT requires access to the optimal action during training which may be hard in several scenarios. Diverging from these works, our learning algorithm does not need the knowledge of optimal action per task during training but predicts a reward vector for each of the actions using only the observed offline data from the diverse training tasks. Finally, during inference time, it selects action using the reward predictions employing various exploration strategies in-context for an unseen test task. We show that our model outperforms other methods like DPT, and Algorithmic Distillation (AD) and matches the performance of algorithms that requires privileged information on the structure of the problem. Interestingly, we show that our algorithm, without the knowledge of the underlying problem structure, can learn a near-optimal policy in-context by leveraging the shared structure across diverse tasks. We show that when the shared structure breaks down with the introduction of new actions both during training and test time, our proposed algorithm fails to learn the underlying latent structure. We further show that our algorithm conducts an implicit two-phase exploration and validate all of these findings over several experiments spanning linear, non-linear, real-life datasets, bilinear, and latent bandit settings. Finally, we theoretically analyze the performance of our algorithm and obtain generalization bounds in the in-context multi-task learning setting.

Submission Number: 183

Loading