Tutorial on Joint Embedding Predictive Architectures (JEPA): Foundations, Applications, and Future Directions

Published: 30 Nov 2025, Last Modified: 07 Jun 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0
Abstract: Joint-Embedding Predictive Architectures (JEPAs) have recently emerged as a unifying paradigm in self-supervised representation learning, combining the semantic alignment of joint-embedding methods with predictive modeling in latent space. This tutorial provides a comprehensive and systematic exposition of JEPA and its extensions, covering its theoretical foundations, architectural design principles, and diverse application domains. We first situate JEPA within the broader taxonomy of representation learning and formulate its core components, including context-target generation, encoding, latent-space prediction, regularization, and energy minimization. Various JEPA applications are also elaborated ranging from downstream tasks facilitated by JEPA to planning and decision-making via predictive world models. In particular, the paper presents a comprehensive framework and pipeline for realizing LeCun's vision of agentic AI, where a multi-level JEPA predictor functions as a latent-space world model integrated with actor training for mode 2 planning and control. The tutorial also surveys emerging domain-specific applications of JEPA in 6G networks, where only a few pioneering studies exist to date. A comprehensive survey of existing JEPA implementations in the literature across various modalities including image, audio, video, point-cloud, and multimodal applications is also presented, highlighting how JEPA principles have been adapted to different data structures and learning tasks. Finally, open challenges and research directions for advancing JEPA in various domains are discussed.
Loading