Abstract: Joint Embedding Predictive Architectures (JEPAs) rely on latent-space prediction to learn
representations without explicit reconstruction. While effective, their predictors are typically
implemented as shallow feed-forward networks, offering limited control over multi-step dynamics and
stability. We introduce Learnable Iterated Function Systems (LIFS), a contractive predictive operator
that replaces the standard JEPA predictor with a learned mixture of affine maps applied recursively
in latent space. Mixture weights are generated conditionally on the context embedding, allowing
the operator to adapt its local geometry across spatial locations and inputs. LIFS does not change
the training objective or encoder architecture, but explicitly constrains predictor dynamics through
spectral control and adaptive gating. Additionally, our analysis unifies spectral control, exponential
moving average (EMA) updates, and predictive convergence through a contraction-based perspective.
Empirically, integrating LIFS into JEPA improves training stability and yields consistent, though
moderate, gains in linear probing accuracy, particularly for ViT-based encoders and non-overlapping
prediction settings. These results highlight predictor dynamics as an important and underexplored
design axis in self-supervised learning.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 7371
Loading