Theoretical Analysis of Contrastive Learning in Vision-Language Model Pretraining: The Role of Synthetic Text Captions for Feature Alignment
TL;DR: This paper provides the first theoretical analysis of VLM training dynamics with a nonlinear neural network, which supports synthetic text captions in enhancing pre-training performance.
Abstract: Vision-language models (VLMs) pre-trained on web-sourced image-text pairs have achieved remarkable success in multimodal tasks. Incorporating synthetic text captions during pre-training has been shown to enhance image-text alignment, significantly improving model performance.
Despite these empirical advances, the theoretical understanding of how VLMs align modalities, extract features, and achieve zero-shot capabilities remains limited. This paper provides the first theoretical analysis of VLM training dynamics with nonlinear activation functions and offers the first theoretical support for synthetic text captions in enhancing pre-training performance. Specifically, we analyze the impact of misaligned image-text pairs, showing that neurons trained on noisy data learn mixtures of true and spurious features, degrading generalization. In contrast, text generated by image-grounded text decoders reduces spurious correlations and improves model accuracy, enabling success in zero-shot multi-class classification where models trained on raw text fail.
While our analysis uses simplified models for theoretical tractability, our findings are validated through experiments on state-of-the-art VLMs, such as BLIP.
Primary Area: Deep Learning->Theory
Keywords: Vision-language model, Contrastive learning, Multimodal learning, Learning theory, Feature learning
Submission Number: 8566
Loading