TempCLIP: Adaptive Temperature Control for Robust Multimodal Alignment

ACM SGA 2025 Workshop TriFusion Submission7 Authors

21 Sept 2025 (modified: 23 Sept 2025)ACM SGA 2025 Workshop TriFusion SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: CLIP; multimodal learning; cross-modal alignment; temperature controller; few-shot learning; loss optimization
Abstract: Multimodal models such as CLIP align images and texts in a unified feature space, enabling cross-modal tasks like retrieval, captioning, and classification. Despite strong representation and zero-shot gen- eralization, CLIP faces challenges in complex or few-shot scenarios, where occlusion, low light, or multiple objects reduce feature dis- crimination and semantic alignment. To address this, we introduce a learnable temperature controller in the image encoder to enhance feature separation, jointly optimize with ID, MLM, and SDM losses, and further propose a semantic similarity–weighted triplet loss to improve cross-modal understanding under challenging conditions.
Submission Number: 7
Loading