Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Shizhan Gong; Yankai Jiang; Qi Dou; Farzan Farnia

Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Kernel-based Unsupervised Alignment of CLIP and DINOv2 Embeddings

Abstract: Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance. The code and models are available at https://github.com/peterant330/KUEA.

Lay Summary: Vision-language models like CLIP are effective at linking images and text but struggle with fine-grained image details, which limits their performance in complex tasks. In contrast, image-focused models like DINOv2 excel at capturing these details. This study proposes a method to improve CLIP by aligning its visual representations with DINOv2 while preserving its compatibility with text embeddings. Fine-tuning the visual encoder enhances CLIP’s ability to recognize objects, reason spatially, and localize details more accurately. When the improved visual encoder is integrated into larger multi-modal systems, these systems show better performance on tasks requiring detailed visual understanding. This approach addresses CLIP’s key limitations, making vision-language models more effective and versatile.

Link To Code: https://github.com/peterant330/KUEA

Primary Area: Deep Learning->Other Representation Learning

Keywords: vision-language model, representation learning, CLIP

Submission Number: 1059

Loading