Abstract: Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines. We provide the source code in the supplementary materials for reproducibility.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper makes a contribution to the field of multimedia/multimodal processing by addressing the challenging task of Unsupervised Few-Shot Learning (U-FSL). U-FSL is a critical task in multimedia scenarios, as collecting a large number of labeled training samples is often challenging and expensive in practical applications. And this paper enhances multimedia and multimodal processing by merging contrastive learning with masked image modeling. This integration boosts U-FSL, allowing the model to excel in tasks with limited data and across various domains. MICM's adaptability enhances feature discrimination and generalization, making it highly effective for multimedia applications where rapid adaptation to new, diverse data types is crucial.
Supplementary Material: zip
Submission Number: 1367
Loading