Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-task Offline RL

Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao

Published: 01 Jan 2025, Last Modified: 24 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline Reinforcement Learning (RL) pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: not all prompts are equally informative for differentiating between tasks. This limits generalization and adaptation, especially in low-data or open-world settings where sample efficiency is crucial. To address this issue, we propose a lightweight, inference-time, bandit-based prompt-tuning framework. The bandit explores and optimizes trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines. These results highlights the importance of adaptive prompt selection mechanisms for efficient generalization in offline multi-task RL.

External IDs:doi:10.1007/978-981-95-0988-1_3