Abstract: The challenge of detecting novel categories with limited annotated samples for learning is referred to as few-shot object detection (FSOD). Due to the scarcity of data, networks struggle to learn robust features that effectively represent object categories from limited samples. Currently, most methods addressing FSOD tasks rely solely on a single modality in network design, overlooking the advanced semantic relationships between categories and their descriptions. We propose a method named Text Generation and Multi-Modal Knowledge Transfer for Few-Shot Object Detection (MMKT). This approach employs a Text Prompt Descriptors Generator (TPDG) to generate prompts tailored to specific categories, reducing the reliance on particular knowledge sources observed in previous methods. To better understand the relationship between text-based descriptions and visual features in a shared space, we develop an Image–Text Matching module to establish correlations between textual and visual features. Experiments demonstrate the effectiveness of MMKT method.
Loading