VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

ICLR 2026 Conference Submission7990 Authors

16 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: JEPA, VLM, video-language, efficiency
TL;DR: We introduce a vision-language model based on JEPA, that achieves competitive socres while being more efficient during training and inference.
Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model can focus on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that can reduce the number of decoding operations by approximately 2.85× while maintaining similar performance compared to dense non-adaptive uniform decoding. Beyond generation, the embedding-space formulation naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE, and POPEv2, despite only having 1.6B parameters.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7990
Loading