Abstract: Recognizing visual patterns in low-data regime necessitates deep neural networks to glean generalized representations from limited training samples. In this letter, we propose a novel few-shot classification method, namely ProDFSL, leveraging multimodal knowledge and attention mechanism. We are inspired by recent advances of large language models and the great potential they have shown across a wide range of downstream tasks and tailor it to benefit the remote sensing community. We utilize ChatGPT to produce class-specific textual inputs for enabling CLIP with rich semantic information. To promote the adaptation of CLIP in remote sensing domain, we introduce a cross-modal knowledge generation module, which dynamically generates a group of soft prompts conditioned on the few-shot visual samples and further uses a shallow Transformer to model the dependencies between language sequences. Fusing the semantic information with few-shot visual samples, we build representative class prototypes, which are conducive to both inductive and transductive inference. In extensive experiments on standard benchmarks, our ProDFSL consistently outperforms the state of the art in few-shot learning (FSL).
Loading