Keywords: self-supervised learning, JEPA, bisimulation, representation learning, model-based reinforcement learning, concept discovery
TL;DR: We show that in RL, transition dynamics combined with an auxiliary function determine the distinctions JEPA encoders must preserve, yielding a theory of when representation collapse cannot occur.
Abstract: Joint-Embedding Predictive Architecture (JEPA) is increasingly used, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove that if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary label, must map to distinct latent representations. Thus, the auxiliary task determines which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a new way to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations.
Submission Number: 113
Loading