Keywords: Contrastive Learning, Multi-View Learning, Multimodal Learning, Vision-Language Representation Degeneration
TL;DR: We propose QUEST, a novel framework with quaternion objectives and constraints to both capture shared and unique information.
Abstract: Multimodal contrastive learning (MCL) has recently demonstrated significant success across various tasks. However, the existing MCL treats all negative samples equally and ignores the potential semantic association with positive samples, which limits the model's ability to achieve fine-grained alignment. In multi-view scenarios, MCL tends to prioritize shared information while neglecting modality-specific unique information across different views, leading to feature suppression and suboptimal performance in downstream tasks. To address these limitations, we propose a novel contrastive framework name *QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization*. In the QUEST framework, we propose quaternion contrastive objectives and orthogonal constraints to extract sufficient unique information. Meanwhile, a shared information-guided penalization is introduced to ensure that shared information does not excessively influence the optimization of unique information. Our method leverages quaternion vector spaces to simultaneously optimize shared and unique information. Experiments on multiple datasets show that our method achieves superior performance in multimodal contrastive learning benchmarks. On public benchmark, our approach achieves state-of-the-art performance, and on synthetic shortcut datasets, we outperform existing baseline methods by an average of 97.95\% on the CLIP model.
Supplementary Material: zip
Primary Area: Deep learning architectures
Submission Number: 5373
Loading