Align-VL: Can Being Modest Help in the Alignment of Vision-Language Models?

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Multimodal Alignment, Embedding Smoothing
TL;DR: Align-VL, a method enhancing multimodal alignment, mitigates model overconfidence and reduces data and computational demands, outperforming existing models in image-text retrieval tasks while requiring fewer resources.
Abstract: Multimodal alignment aims to learn a shared latent space between different modal inputs to establish connections across modalities. A prime example is Visual Language Models (VLMs), such as CLIP, which benefit from extensive image-text pre-training and excel in image recognition tasks. These models are emblematic of successful multimodal alignment. Subsequent work has successfully aligned multimodal data on limited datasets using feature mixing enhancement methods. However, these models encounter significant challenges: {The presence of {ambiguous samples (either partially matched or completely unmatched)} in datasets with weakly associated, low-quality image-text pairs causes models to become overconfident (in training) and confused (in inference), ultimately reducing performance}. Current contrastive learning methods, which rely on single positive pairs, exacerbate this issue by encouraging overconfidence when the model encounters such ambiguous samples. To overcome these challenges, we developed Align-VL, a multimodal alignment enhancement method that operates on the latent spaces of pre-trained unimodal encoders. This approach adjusts the matching degree of the data and moderates model overconfidence, promoting more appropriate and effective alignments. Align-VL incorporates {Random Perturbation} and {Embedding Smoothing} strategies to enhance input feature robustness and reduce model overconfidence, improving the model's ability to manage uncertainty and generalize to new data. In our experiments, Align-VL outperformed existing state-of-the-art (SoTA) methods in image-text retrieval tasks, demonstrating its superior effectiveness. Align-VL also offers significant reductions in training time and data requirements compared to methods like CLIP, using substantially fewer GPU days and image-text pairs. Code will be publicly available.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10789
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview