Successor Clusters: A Behavior Basis for Unsupervised Zero-Shot Reinforcement Learning

TMLR Paper3834 Authors

03 Dec 2024 (modified: 20 Feb 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we introduce Successor Clusters (SCs), a novel method for tackling unsupervised zero-shot reinforcement learning (RL) problems. The goal in this setting is to directly identify policies capable of optimizing any given reward functions without requiring further learning after an initial reward-free training phase. Existing state-of-the-art techniques leverage Successor Features (SFs)---functions capable of characterizing a policy's expected discounted sum of a set of $d$ reward features. Importantly, however, the performance of existing techniques depends critically on how well the reward features enable arbitrary reward functions of interest to be linearly approximated. We introduce a novel and principled approach for constructing reward features and prove that they allow for any Lipschitz reward functions to be approximated arbitrarily well. Furthermore, we mathematically derive upper bounds on the corresponding approximation errors. Our method constructs features by clustering the state space via a novel distance metric quantifying the minimal expected number of timesteps needed to transition between any state pairs. Building on these theoretical contributions, we introduce Successor Clusters (SCs), a variant of the successor features framework capable of predicting the time spent by a policy in different regions of the state space. We demonstrate that, after a pre-training phase, our method can approximate and maximize any new reward functions in a zero-shot manner. Importantly, we also formally show that as the number and quality of clusters increase, the set of policies induced by Successor Clusters converges to a set containing the optimal policy for any new task. Moreover, we show that our technique naturally produces interpretable features, enabling applications such as visualizing the sequence of state regions an agent is likely to visit while solving a task. Finally, we empirically demonstrate that our method outperforms state-of-the-art competitors in challenging continuous control benchmarks, achieving superior zero-shot performance and lower reward approximation error.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Oleg_Arenz1
Submission Number: 3834
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview