Contractive MASO-Generalized Predictors for Stable Latent-Space Learning in JEPA

TMLR Paper7371 Authors

06 Feb 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations by predicting latent target embeddings from contextual views, but their predictors are typically shallow feed-forward networks with limited control over multi-step dynamics and stability. We introduce Learnable Iterated Function Systems (LIFS), a recursive and contractive latent operator that replaces the standard JEPA predictor with a mixture of affine maps applied over multiple refinement steps. The mixture weights are conditioned on the context embedding, enabling input-adaptive geometric structure while preserving the original JEPA objective and encoder architecture. LIFS can be viewed as a dynamical extension of Max-Affine Spline Operators (MASOs): instead of selecting a single affine branch, the latent state evolves through a sequence of spectrally-controlled affine transformations, yielding trajectory-dependent partitions of the latent space and explicit contraction guarantees. We establish sufficient conditions under which LIFS defines a Banach contraction and analyze its training behavior through Lyapunov-style stability arguments. Empirically, integrating LIFS into JEPA leads to smoother training dynamics and consistent improvements in predictive alignment and linear-probe accuracy, particularly for ViT encoders, without increasing model capacity. Overall, LIFS provides a principled and modular way to endow JEPA predictors with stable multi-step latent refinement, bridging predictive self-supervision, MASO theory, and contraction-based dynamical systems.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: paragraph{(1) Multi-scale latent modeling and fractal interpretation.} We agree that our current evaluation focuses primarily on standard downstream representation quality (linear probing) rather than explicitly measuring scale-equivariant or hierarchical semantic behavior. Our use of the term multi-scale'' refers to the recursive composition of affine latent operators with different contraction strengths, as formalized in Appendix~A.1-A.6. We acknowledge that this does not yet constitute a direct demonstration of scale-aware semantic decomposition in the classical vision sense. To clarify this point, we have revised the manuscript to: \begin{itemize} \item tone down the terminology around fractal modeling'', \item explicitly described LIFS as a \emph{recursive contractive latent operator}, \item clarify that the observed multi-scale behavior refers to latent dynamical scales induced by iterative contractions. \end{itemize} We also agree that the contribution is a recursive contractive MASO generalization, not a fractal model; we have also changed the title in accordance with the content. The proposed title is: • Accurate (matches the theory and experiments) • Specific (clearly identifies the contribution) • Reviewer‑friendly (addresses the reviewer's concerns about overclaiming “fractal”) • Technically strong (MASO + stability + recursion) • Distinctive (positions the work within JEPA research) It is much better than the original “Fractal Predictive Operators” title, which reviewers would find misleading.
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 7371
Loading