Abstract: Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring long-tailed multi-label image classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work focuses on addressing the Long-Tailed Multi-Label Image Classification (LTMLC) task, a complex challenge that has seen limited application of multimodal fusion strategies. In our approach, CPRFL leverages CLIP’s text encoder to extract category semantics, utilizing its powerful semantic representation capability to establish semantic correlations between head and tail classes. The extracted category semantics serve as category-prompts, allowing for the decoupling of category-specific visual representations from samples. By effectively utilizing visual-semantic information interaction, a way of multimodal fusion, CPRFL integrates CLIP's linguistic knowledge with the visual representations from images to discern context-related visual information specific to each category. This approach culminates in computing the feature similarity between category-specific visual representations and their corresponding prompts for multi-label inference, a method commonly used to align multimodal features. In summary, our proposed CPRFL centers on multimodal information fusion between CLIP's linguistic knowledge and the visual representations from images to address the LTMLC task. This innovative approach advances the field of multimedia and multimodal processing by effectively bridging the gap between the semantic and visual domains.
Supplementary Material: zip
Submission Number: 2857
Loading