Keywords: LAION-5B, Demographic Bias, Intersectional Bias, Dataset Analysis, Fairness
TL;DR: The study finds that the LAION-5B dataset systematically overrepresents young, White, male faces and underrepresents minority groups, while reinforcing stereotypical links between demographics and emotions.
Abstract: Large-scale image-text datasets, such as LAION-5B, are foundational to modern AI systems, yet their vast scale and uncurated nature raise significant concerns about demographic and stereotypical biases. This study presents a comprehensive analysis of the demographic composition and representational, stereotypical, and intersectional biases in LAION-2B-en and LAION-2B-multi, the two main components of the LAION-5B dataset. Using state-of-the-art models---FairFace, DeepFace, and Emo-AffectNet---we analyze faces detected in the dataset to identify biases across age, gender, race, and expressed emotion. Our findings reveal substantial overrepresentation of young adults (20--39), White individuals, and males, alongside consistent underrepresentation of minority racial groups and middle-aged or older women across both dataset components. We also observe stereotypical associations between demographic attributes and emotions, such as “Anger” being predominantly linked to males and “Happiness” to females, pointing to systemic imbalances in the data. The consistency of these patterns across two demographic models and both components of LAION-5B demonstrates that these biases are deeply embedded in one of the most widely-used training datasets. Given the scale at which LAION-5B is used to train generative models, these demographic imbalances could shape the behavior and outputs of numerous downstream AI systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18240
Loading