Abstract: The emergence of vision-language foundation models has enabled the integration of textual information into vision-based applications. However, in few-shot classification and segmentation (FS-CS), this potential remains underutilised. Commonly, self-supervised vision models have been employed, particularly in weakly-supervised scenarios, to generate pseudo-segmentation masks, as ground truth masks are typically unavailable and only target classification is provided. Despite their success, such models find it difficult to capture accurate semantics when compared to vision-language models. To address this limitation, we propose a novel FS-CS approach that leverages the rich semantic alignment of vision-language models to generate more precise pseudo ground-truth masks. While current vision-language models excel in global visual-text alignment, they struggle with finer, patch-level alignment, which is crucial for detailed segmentation tasks. To overcome this, we introduce a method that enhances patch-level alignment without requiring additional training. In addition, existing FS-CS frameworks typically lacks multi-scale information, limiting their ability to capture fine and coarse features simultaneously. To overcome this, we incorporate a module based on atrous convolutions to inject multi-scale information into the feature maps. Together, these contributions - text enhanced pseudo-mask generation and improved multi-scale feature representation - significantly boost the performance of our model in weakly-supervised settings, surpassing state-of-the-art methods and demonstrating the importance of integrating multi-modal information for robust FS-CS solutions.
External IDs:dblp:conf/icassp/Nandam00K025
Loading