Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning, in-context reinforcement learning, transformers, preference-based learning
Abstract: In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods require explicit reward signals during pretraining, limiting their applicability in real-world scenarios where rewards are ambiguous, difficult to specify, or expensive to collect. To overcome this limitation, we propose a new learning paradigm, *In-Context Preference-based Reinforcement Learning* (ICPRL), where both the pretraining of TMs and their deployment to new tasks rely solely on preference data, thereby eliminating the need for reward supervision. Within this paradigm, we study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a proven strategy in ICRL, remains effective in ICPRL for training TMs to predict optimal actions using preference-based context datasets. To improve data efficiency, we further propose alternative frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without relying on optimal action labels or reward signals. Empirical evaluations on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong generalization to unseen RL tasks, achieving performance on par with ICRL methods trained with full reward supervision.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23741
Loading