Keywords: CLIP; multimodal learning; cross-modal alignment; temperature controller; few-shot learning; loss optimization
Abstract: Multimodal models such as CLIP align images and texts in a unified  feature space, enabling cross-modal tasks like retrieval, captioning, and classification. Despite strong representation and zero-shot gen- eralization, CLIP faces challenges in complex or few-shot scenarios, where occlusion, low light, or multiple objects reduce feature dis- crimination and semantic alignment. To address this, we introduce  a learnable temperature controller in the image encoder to enhance  feature separation, jointly optimize with ID, MLM, and SDM losses, and further propose a semantic similarity–weighted triplet loss to  improve cross-modal understanding under challenging conditions.
Submission Number: 7
Loading