From Weak Data to Strong Policy: Q-Targets Enable Provable In-Context Reinforcement Learning

ICLR 2026 Conference Submission18814 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: In-Context Reinforcement Learning
Abstract: Transformers trained with offline expert-level data have shown remarkable success in In-Context Reinforcement Learning (ICRL), enabling effective decision-making in unseen environments. However, the performance of these models heavily depends on optimal or expert-level trajectories, making them expensive in various real-world scenarios. In this work, we introduce Q-Target Pretrained Transformers (QTPT), a novel framework that leverages Q-learning instead of supervised learning during the training stage. In particular, QTPT doesn't require optimal-labeled actions or expert trajectories, and provides a practical solution for real-world applications. We theoretically establish the performance guarantee for QTPT and show its superior robustness to data quality compared to traditional supervised learning approaches. Through comprehensive empirical evaluations, QTPT consistently outperforms existing approaches, especially when trained on data sampled with non-expert policies.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18814
Loading