Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations by predicting latent target
embeddings from contextual views, but their predictors are typically shallow feed-forward networks
with limited control over multi-step dynamics and stability. We introduce Learnable Iterated
Function Systems (LIFS), a recursive and contractive latent operator that replaces the standard
JEPA predictor with a mixture of affine maps applied over multiple refinement steps. The mixture
weights are conditioned on the context embedding, enabling input-adaptive geometric structure
while preserving the original JEPA objective and encoder architecture. LIFS can be viewed as a
dynamical extension of Max-Affine Spline Operators (MASOs): instead of selecting a single affine
branch, the latent state evolves through a sequence of spectrally-controlled affine transformations,
yielding trajectory-dependent partitions of the latent space and explicit contraction guarantees. We
establish sufficient conditions under which LIFS defines a Banach contraction and analyze its training
behavior through Lyapunov-style stability arguments. Empirically, integrating LIFS into JEPA leads
to smoother training dynamics and consistent improvements in predictive alignment and linear-probe
accuracy, particularly for ViT encoders, without increasing model capacity. Overall, LIFS provides
a principled and modular way to endow JEPA predictors with stable multi-step latent refinement,
bridging predictive self-supervision, MASO theory, and contraction-based dynamical systems.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: paragraph{(1) Multi-scale latent modeling and fractal interpretation.} We agree that our current evaluation focuses primarily on standard downstream representation quality (linear probing) rather than explicitly measuring scale-equivariant or hierarchical semantic behavior. Our use of the term multi-scale'' refers to the recursive composition of affine latent operators with different contraction strengths, as formalized in Appendix~A.1-A.6. We acknowledge that this does not yet constitute a direct demonstration of scale-aware semantic decomposition in the classical vision sense. To clarify this point, we have revised the manuscript to: \begin{itemize} \item tone down the terminology around fractal modeling'', \item explicitly described LIFS as a \emph{recursive contractive latent operator}, \item clarify that the observed multi-scale behavior refers to latent dynamical scales induced by iterative contractions. \end{itemize} We also agree that the contribution is a recursive contractive MASO generalization, not a fractal model; we have also changed the title in accordance with the content.
The proposed title is:
• Accurate (matches the theory and experiments)
• Specific (clearly identifies the contribution)
• Reviewer‑friendly (addresses the reviewer's concerns about overclaiming “fractal”)
• Technically strong (MASO + stability + recursion)
• Distinctive (positions the work within JEPA research)
It is much better than the original “Fractal Predictive Operators” title, which reviewers would find misleading.
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 7371
Loading