Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper addresses imbalance in pseudolabels generated by VLMs, identifying concept mismatch and concept confusion as key causes. It proposes a framework with concept alignment and confusion-aware margin to improve pseudolabel balance and accuracy.
Abstract: Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29\% over the SoTA method. Our code is avaliable at https://github.com/Noahwangyuchen/CAP
Lay Summary: Modern AI systems that connect images with language—called vision-language models (VLMs)—are being used to label new images without human effort. However, these automatic labels (called pseudolabels) are often unbalanced. That means the model favors some categories over others, which leads to poor performance when applied to real-world tasks. We explored why this imbalance happens and discovered two key issues: the model may fail to extract the precise meaning of certain categories (concept mismatch), or confuse similar-looking ones (concept confusion). To fix this, we designed a new framework called CAP, combining concept alignment to help the model better match text and images, and confusion-aware calibrated margin to to help the model better tell similar categories apart. Our approach leads to more accurate and fair labels across categories. We tested it on six widely-used datasets and three learning setups, showing that it consistently improves results—by over 6% compared to the best existing method.
Link To Code: https://github.com/Noahwangyuchen/CAP
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: semi-supervised learning, unsupervised learning, vision-language models, prompt tuning
Submission Number: 1261
Loading