Keywords: Vision-Language Models, fairness, social bias
Abstract: Vision-language models like CLIP are widely used for multimodal retrieval tasks. However, they can learn historical biases from their training data, resulting in the perpetuation of stereotypes and potential harm. In this study, we analyze the social biases present in CLIP, particularly in the interaction between image and text. We introduce a taxonomy of social biases called So-B-IT, consisting of 374 words categorized into ten types of bias. These biases can have negative societal effects when associated with specific demographic groups. Using this taxonomy, we investigate the images retrieved by CLIP from a facial image dataset using each word as a prompt. We observe that CLIP often exhibits undesirable associations between harmful words and particular demographic groups. Furthermore, we explore the source of these biases by demonstrating their presence in a large image-text dataset used to train CLIP models. Our findings emphasize the significance of evaluating and mitigating bias in vision-language models, underscoring the necessity for transparent and fair curation of extensive pre-training datasets.
Submission Number: 37
Loading