Online Finetuning Decision Transformers with Policy Gradients

Online Finetuning Decision Transformers with Policy Gradients

ICLR 2026 Conference Submission15639 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, decision transformer, continuous control

TL;DR: We adapted GRPO to fine-tune pretrained decision transformers with pure policy gradients, achieving state-of-the-art performance on several benchmarks.

Abstract: Decision Transformer (DT) has emerged as a powerful paradigm for decision making by formulating offline Reinforcement Learning (RL) as a sequence modeling problem. While recent studies have started to investigate how Decision Transformers can be extended to online settings, online finetuning with pure RL gradients remains largely underexplored: most existing approaches continue to prioritize supervised sequence modeling losses during the online phase. In this paper, we introduce a new algorithm that performs online finetuning solely through generative rollouts and the corresponding RL gradients. Our approach represents a novel adaptation of the classical GRPO algorithm to the online finetuning of Decision Transformers. To make GRPO efficient and compatible with DTs, we incorporate several key modifications, including sub-trajectory sampling, sequence-likelihood objectives, and a reset-based sampling strategy. We conduct extensive experiments across diverse benchmarks and show that, on average, our method significantly outperforms existing online finetuning approaches such as ODT and ODT+TD3. This opens a new direction for advancing the online finetuning of Decision Transformers.

Primary Area: reinforcement learning

Submission Number: 15639

Loading