An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Shixuan Liu; Daniel A Li; Yiwei Lyu; Akhil Kondepudi; Honglak Lee; Todd C Hollon

An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Shixuan Liu, Daniel A Li, Yiwei Lyu, Akhil Kondepudi, Honglak Lee, Todd C Hollon

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Track: Proceedings Track

Keywords: Multimodal Representation Learning, Contrastive Learning, Self-supervised Learning

TL;DR: We conduct an empirical study on visual representation learning through jointly optimizing I-JEPA and CLIP objectives.

Abstract: Unified visual representations from language supervision and self-supervision offer the potential to advance general-purpose vision models. In this work, we present an empirical study on unifying joint-embedding predictive architecture (I-JEPA) with language supervision from CLIP for visual representation learning. I-JEPA is unique among self-supervised learning methods in that it is predictive rather than contrastive or generative, enabling faster convergence with less compute while still producing strong representations. Existing works have shown that joint training with language supervision and other visual self-supervision methods yield improved model performance, but combining language supervision with I-JEPA remains unexplored. We introduce CLIPred, a framework that jointly optimizes the two objectives, and systematically evaluate it across zero-shot classification, retrieval, and probing tasks. CLIPred outperforms CLIP-only, I-JEPA-only, and sequentially applying the two, and offers better zero-shot transfer than DINOv2+CLIP with lower training cost, though with trade-offs in probing performance. Our experiments further examine the effects of loss weighting, amount of data used by each objective, and batch size on our framework, We conduct further analysis on design choices of the architecture and the semantics of the patch embeddings generated by CLIPred. This work provides the first comprehensive assessment of combining I-JEPA and CLIP, highlighting both the benefits and limitations of the framework as well as recommendations on when and how to apply the framework.

Submission Number: 61

Loading