ViSeNet: A Visual-Semantic Fusion Network for Enhancing Few-Shot Classification

Published: 2025, Last Modified: 18 Sept 2025ICCE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Few-shot learning addresses the challenge of recognizing novel classes with only a few labeled examples, a critical issue in limited data scenarios. This paper proposes a multi-modal Visual and Semantic Network (ViSeNet), which enhances few-shot classification by mining linguistic features and fusing weighted vectors. ViSeNet generates enriched class prototypes by leveraging both visual and semantic information. The linguistic feature mining process utilizes image-to-text models to produce class-specific textual descriptions, which are then transformed into representative semantic embeddings. For visual feature extraction, we employ a pretrained Swin Transformer, while text embeddings are integrated for each class using weighted factors. By combining these modalities, ViSeNet constructs class prototypes that capture both visual and semantic characteristics. Experimental results on benchmark datasets, particularly CIFAR-FS, demonstrate that ViSeNet achieves performance comparable to state-of-the-art approaches.
Loading