Abstract: In recent years, the Few-Shot Fine-Grained Image Classification (FS-FGIC) problem has gained widespread attention. A number of effective methods have been proposed that focus on extracting discriminative information within high-level features in a single episode/task. However, this is insufficient for addressing the cross-task challenges of FS-FGIC, which is represented in two aspects. On the one hand, from the perspective of the Fine-Grained Image Classification (FGIC) task, there is a need to supplement the model with mid-level features containing rich fine-grained information. On the other hand, from the perspective of the Few-Shot Learning (FSL) task, explicit modeling of cross-task general knowledge is required. In this paper, we propose a novel Bi-directional Task-Guided Network (BTG-Net) to tackle these issues. Specifically, from the FGIC task perspective, we design the Semantic-Guided Noise Filtering (SGNF) module to filter noise on mid-level features rich in detailed information. Further, from the FSL task perspective, the General Knowledge Prompt Modeling (GKPM) module is proposed to retain the cross-task general knowledge by utilizing the prompting mechanism, thereby enhancing the model's generalization performance on novel classes. We have conducted extensive experiments on five fine-grained benchmark datasets, and the results demonstrate that BTG-Net outperforms state-of-the-art methods comprehensively.
Primary Subject Area: [Content] Media Interpretation
Relevance To Conference: This work is directly relevant to Media Interpretation and can be used for feature processing of multiple visual media.
We propose a novel Bi-directional Task-Guided Network (BTG-Net) to meet the feature needs of both fine-grained task and few-shot learning task. Our method enhances visual media processing by tackling two critical aspects. Firstly, the Semantic-Guided Noise Filtering (SGNF) module improves the extraction of fine-grained visual features. Secondly, the General Knowledge Prompt Modeling (GKPM) module retains cross-task general knowledge, enhancing the model's ability to interpret and generalize visual media contexts. Our method achieves state-of-the-art results on five fine-grained benchmarks, verifies the effectiveness of the method for visual media inference, and theoretically can be applied to a variety of visual data modal.
Supplementary Material: zip
Submission Number: 1170
Loading