Improved Vision-Language Alignment via Text-Conditioned Image Embeddings using Sparse Autoencoders

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-language Alignment, CLIP, Modality Gap, Sparse Autoencoders
TL;DR: We propose a method to improve image-text alignment by learning to edit image embeddings conditioned on texts, using sparse autoencoders.
Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, they often suffer from a modality gap where image and text embeddings are poorly aligned, affecting downstream performance. Recent work has shown that the modality gap can be attributed to an information imbalance between the two modalities. In this work, we propose LMask-Edit, a framework that explicitly models the information imbalance and addresses it by editing image embeddings conditioned on text. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on the text conditioning. Using a controlled setup, we show that LMask-Edit is effective at conditioning and improves cross-modal alignment. By applying LMask-Edit to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse (MS COCO, Flickr) and fine-grained (IIW, DOCCI) benchmarks, as well robust retrieval on the RoCOCO benchmark, demonstrating its promise for improving learned representations.
Submission Number: 5
Loading