PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning

Xuesong Wang, Hengrui Zhang, Jiazhi Zhang, Yuhu Cheng

Published: 10 Jul 2025, Last Modified: 26 Jan 2026IEEE Transactions on Systems, Man, and Cybernetics: Systems (IEEE TSMCA)EveryoneCC BY 4.0

Abstract: Decision transformer (DT), as a conditional sequence modeling approach, learns the action distribution for each state using historical information such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning. However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel pessimistic critic decision transformer (PCDT) for offline reinforcement learning is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in PCDT establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor-critic and conditional sequence modeling methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks.