Abstract: The goal of offline reinforcement learning (RL) is to extract the best possible policy from the previously collected dataset considering the *out-of-distribution* (OOD) sample issue. Offline model-based RL (MBRL) is a captivating solution capable of alleviating such issues through a \textit{state-action transition augmentation} with a learned dynamic model. Unfortunately, offline MBRL methods have been observed to fail in sparse rewarded and long-horizon environments for a long time. In this work, we propose a novel MBRL method, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates additional transitions in a geometrically structured representation space, instead of state space. For comprehending long-horizon behaviors efficiently, our main idea is to learn state abstraction, which captures a *temporal distance* from both *trajectory and transition levels* of state space. Our experiments empirically confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen.
Lay Summary: Until now, offline model-based reinforcement learning (MBRL) has struggled with long-horizon tasks where rewards are sparse, because naively augmenting data in the raw state space often produces unrealistic transitions and fails to connect distant start and goal states within limited datasets.
Our approach, Temporal Distance-Aware Transition Augmentation (TempDATA), first learns a compact latent representation that captures true temporal distances between states, then generates new transitions in this space and decodes them back into realistic trajectories, ensuring augmented data respects the multi-step structure needed to reach far-away goals.
By focusing on time-aware augmentation, TempDATA not only enhances offline MBRL but can be seamlessly integrated into policy-only (model-free) methods, skill or hierarchical RL, and goal-conditioned RL; it likewise complements both model-based and model-free learning pipelines, broadening its applicability across diverse reinforcement learning paradigms.
Primary Area: Reinforcement Learning->Batch/Offline
Keywords: State Representation, Latent World Model, Data Augmentation
Submission Number: 1959
Loading