VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: JEPA, VLM, video-language, efficiency
TL;DR: We introduce a vision-language model based on JEPA, that achieves competitive socres while being more efficient during training and inference.
Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by ~2.85× while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets—GQA, TallyQA, POPE, and POPEv2—despite having only 1.6B parameters.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7990
Loading