Rebalancing Return Coverage for Conditional Sequence Modeling in Offline Reinforcement Learning

Wensong Bai; Chufan Chen; Yichao Fu; Qihang Xu; Chao Zhang; Hui Qian

Rebalancing Return Coverage for Conditional Sequence Modeling in Offline Reinforcement Learning

Wensong Bai, Chufan Chen, Yichao Fu, Qihang Xu, Chao Zhang, Hui Qian

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Reinforcement Learning, Conditional Sequence Modeling, Return Coverage Rebalancing

TL;DR: We reveal how return-coverage affects the performance of conditional sequence modeling policies in offline RL and propose an algorithm achieving new state-of-the-art results on D4RL.

Abstract: Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of conditional sequence modeling (CSM), a paradigm that models the action distribution conditioned on both historical trajectories and target returns associated with each state. However, due to the imbalanced return distribution caused by suboptimal datasets, CSM is grappling with a serious distributional shift problem when conditioning on high returns. While recent approaches attempt to empirically tackle this challenge through return rebalancing techniques such as weighted sampling and value-regularized supervision, the relationship between return rebalancing and the performance of CSM methods is not well understood. In this paper, we reveal that both expert-level and full-spectrum return-coverage critically influence the performance and sample efficiency of CSM policies. Building on this finding, we devise a simple yet effective return-coverage rebalancing mechanism that can be seamlessly integrated into common CSM frameworks, including the most widely used one, Decision Transformer (DT). The resulting CSM algorithm, referred to as Return-rebalanced Value-regularized Decision Transformer (RVDT), integrates both implicit and explicit return-coverage rebalancing mechanisms, and achieves state-of-the-art performance in the D4RL experiments.

Supplementary Material: zip

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 7519

Loading