Keywords: reinforcement learning, offline reinforcement learning, zero-shot reinforcement learning, contrastive learning
Abstract: Successor measures capture long-horizon, forward-in-time state occupancy statistics for a given policy. Prior RL and neuroscience work has identified the successor measure as a sufficient statistic for estimating value functions for arbitrary rewards, making this measure an important mechanism for offline-to-online adaptation. However, modern RL methods like Contrastive RL (CRL) based off of estimating and using successor features often violate these on-policy assumptions. In this work, we identify a failure mode in offline-to-online adaptation as a result of training successor features over the mixed policy buffers. We present didactic tabular results and results in continuous, high-dimensional settings reflecting the same failure mode, partially explaining past empirical observations that vanilla CRL cannot scale in the offline setting. These results makes progress towards bridging the gap between scalable CRL methods and developing offline-to-online adaptation methods based on the successor measure.
Submission Number: 136
Loading