Bias in CLIP Encoders: A Study of Encoder Bias and Object Representation in Multi-Object Scenarios

Reza Abbasi; Ali Nazari; Aminreza Sefid; Mohammadali Banayeeanzade; Mohammad Hossein Rohban; Mahdieh Soleymani Baghshah

Bias in CLIP Encoders: A Study of Encoder Bias and Object Representation in Multi-Object Scenarios

Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

28 Sept 2024 (modified: 31 Oct 2024)ICLR 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CLIP, Multi-object, Vision-language Models, Bias

Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We present a specialized dataset, ComCO, crafted to thoroughly assess the performance of CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases in both encoders, with the text encoder showing a tendency to prioritize objects that are mentioned first in the prompt, and the image encoder exhibiting a bias toward larger objects. Through meticulous experiments, including both retrieval-based and classification-based tasks, we quantify these biases across multiple CLIP variants, we quantify these biases across multiple CLIP variants. We hypothesize that these biases originate from CLIP's training process and provide substantiating evidence through detailed analyses of the LAION dataset and CLIP's training progression. Our image-text matching experiments demonstrate substantial performance drops when manipulating object sizes in the images and/or object tokens order in the prompt, highlighting the CLIP's unstable performance when given rephrased yet semantically similar captions. We extend this analysis to longer, more complex captions and text-to-image generative models such as Stable Diffusion, revealing how CLIP's text encoder bias influences object prominence in generated images based on the prompt's token order. This work provides crucial insights into CLIP's behavior in complex visual-linguistic contexts, offering a robust evaluation methodology and identifying key areas for improving future vision-language models in multi-object scenarios.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12849

Loading