Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations by predicting latent targets rather than reconstructing high-dimensional, pixel-level observations. Variational JEPA (VJEPA) extends this idea by replacing deterministic target regression with a probabilistic predictive model and a target-side KL regularizer. This paper provides a theoretical analysis of VJEPA through the lenses of Information Bottleneck (IB) and Predictive Information Bottleneck (PIB). We analyze two forms of VJEPA: the context-target form, naturally associated with IB, and the temporal form, naturally associated with PIB. We show that the current VJEPA objective is a partial bottleneck objective: its latent negative log-likelihood implements a Barber-Agakov lower bound on predictive mutual information, while its target KL regularizes the target-side latent distribution, rather than directly compressing the context or current state. We then derive two completed objectives: a full IB-VJEPA, which introduces a stochastic context encoder and a KL-to-prior penalty that upper-bounds the context information $I(X_C;Z_C)$, and a full PIB-VJEPA as its temporal specialization, which introduces a stochastic current-state encoder and a KL-to-prior penalty that upper-bounds the state information $I(X_{\le t};Z_t)$. The resulting analysis separates three information-theoretic roles that are conflated in standard JEPA-style objectives: predictive information maximization, target-side regularization, and explicit context/state compression. This closes the objective-level gap between probabilistic latent prediction and explicit information bottleneck control, providing a principled route to compression-controlled, uncertainty-aware VJEPA models.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jakub_Mikolaj_Tomczak1
Submission Number: 8929
Loading