Abstract: Traditional deep learning models often struggle in few-shot learning scenarios, where limited labeled data is available.
While the Contrastive Language-Image Pre-training (CLIP) model demonstrates impressive zero-shot capabilities, its performance in few-shot scenarios remains limited. Existing methods primarily aim to leverage the limited labeled dataset, but this offers limited potential for improvement. To overcome the limitations of small datasets in few-shot learning, we introduce a novel framework, SSAT-Adapter, that leverages CLIP's language understanding to generate informative auxiliary tasks and improve CLIP's performance and adaptability in few-shot settings. We utilize CLIP's language understanding to create decision-boundary-focused image latents. These latents form auxiliary tasks, including inter-class instances to bridge CLIP's pre-trained knowledge with the provided examples, and intra-class instances to subtly expand the representation of target classes. A self-paced training regime, progressing from easier to more complex tasks, further promotes robust learning. Experiments show our framework outperforms the state-of-the-art online few-shot learning method by an average of 2.2\% on eleven image classification datasets. Further ablation studies on various tasks demonstrate the effectiveness of our approach to enhance CLIP's adaptability in few-shot image classification.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: Our work advances multimedia modeling by addressing the challenge of few-shot image classification, a scenario where traditional unimodal deep learning models often struggle. The proposed SSAT-Adapter framework leverages the pre-trained knowledge of the Contrastive Language-Image Pre-training (CLIP) model, a multimodal foundation model. By utilizing CLIP's language understanding to dynamically generate auxiliary tasks of varying complexities, the proposed framework focuses on refining decision boundaries and expanding class representations. This allows the proposed method to learn more effectively from limited visual examples. This research directly contributes to the advancement of multimodal modeling in the context of constrained visual search and recommendation scenarios, where success depends on interpreting the relationships between images and associated textual descriptions.
Submission Number: 5120
Loading