Abstract: Deep learning models generally perform well across entire datasets but often exhibit disparate behaviors across different subgroups. Such biases hinder real-world applications. Despite numerous efforts to identify and mitigate biases in biased subgroups using the powerful vision-language foundation model CLIP, these approaches commonly neglect inherent biases in CLIP’s feature encoding, which can restrict performance improvements. In our work, we introduce a novel strategy that employs an ensemble of surrogate models for adaptive and scalable discovery of biased subgroups, effectively reducing the impact of feature encoding biases inherent in CLIP. Additionally, we utilize the large vision-language model to elucidate inherent subgroup biases and employ relative Fisher information to identify critical layers for mitigating subgroup bias and suppressing the learning of shortcuts. Extensive experiments on CIFAR-100, Breeds, and ICSD-171K demonstrate the effectiveness of our proposed methods. We also confirm the presence of subgroup bias by analyzing the image encoder of CLIP on the Hard ImageNet dataset.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yannis_Kalantidis2
Submission Number: 3925
Loading