Abstract: Modeling label correlations has always played a pivotal rolein multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to an overfitting risk due to this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them separately to balance these relationships. In this work, since eachgroup contains multiple classes, multiple prompt tokens areadopted within Vision Transformer (ViT) to capture the cor-relation or discriminative label relationship within eachgroup, and effectively learn correlation or discriminativerepresentations for class subsets. On the other hand, each group contains multiple group-aware visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group-aware to the label-aware, adaptively obtaining a label-aware representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods on multiple pre-trained models.
Loading