Policy-Based Trajectory Clustering in Offline Reinforcement Learning

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, offline reinforcement learning, deep learning, machine learning, clustering
TL;DR: We propose policy-based trajectory clustering methods (PG-Kmeans and CAAE) for offline RL, showing they uncover structure in multi-modal datasets.
Abstract: We introduce the task of clustering trajectories in offline reinforcement learning (RL) datasets to address the multi-modal nature of offline data. Such datasets often contain trajectories from diverse policies, and treating them as a single distribution can obscure structure and increase distributional shift. We formalize trajectory clustering by linking the KL-divergence of offline trajectory distributions with mixtures of policy-induced distributions. To solve this, we propose Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE). PG-Kmeans iteratively trains behavior cloning policies and assigns trajectories based on generation probabilities, while CAAE adopts a VQ-VAE style objective to guide latent representations toward codebook entries. We prove finite-step convergence of PG-Kmeans and analyze the ambiguity of optimal solutions caused by policy-induced conflicts. Experiments on D4RL and GridWorld show that PG-Kmeans and CAAE partition trajectories into coherent clusters and offer a framework for structuring offline data, with applications in data selection, curriculum learning, and policy transfer.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22770
Loading