Keywords: reinforcement learning, representation learning, self-supervised learning, reinforcement learning theory
TL;DR: Bridging the gap between theory (BYOL-PI) and practice of action-conditional self-predictive objective (BYOL-AC), resulting in a new variance-like self-predictive objective (BYOL-VAR), together with a model-free and a model-based unifying view.
Abstract: Learning a good representation is a crucial challenge for reinforcement learning (RL) agents. Self-predictive algorithms jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model in the case of a fixed policy (BYOL-$\Pi$); this assumption is at odds with practical implementations, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework. Interestingly, we uncover that BYOL-$\Pi$ and BYOL-AC are related through the lens of variance. We unify the study of these objectives through two complementary lenses; a model-based perspective, where each objective is related to low-rank approximation of certain dynamics, and a model-free perspective, which relates the objectives to modified value, Q-value, and Advantage functions. This mismatch with the true value functions leads to the empirical observation (in both linear and deep RL experiments) that BYOL-$\Pi$ and BYOL-AC are either very similar in performance across many tasks or task-dependent.
Submission Number: 26
Loading