Abstract: Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social \emph{biases} from their training data. We systematically disentangle three design factors—\emph{model size}, \emph{training‑data scale}, and \emph{training‑data source}---—by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image--text corpora on which they are pre‑trained(400M proprietary pairs vs.\ 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder \emph{reduces} gender skew in CLIP but \emph{amplifies} both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring \emph{data source} as the primary driver of bias patterns.
We also evaluate three post-hoc, test-time debiasing strategies — \emph{Bias Prompts}, \emph{Prompt Array}, and \emph{SANER}. Debiasing \emph{reduces} but does not eliminate harm, and its effectiveness is \emph{source- and size-dependent}: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/zahraaalsahili/CLIP_Bias
Assigned Action Editor: ~Kamalika_Chaudhuri1
Submission Number: 5126
Loading