Measuring Text-Image Retrieval Fairness with Synthetic Data

Lluis Gomez

Published: 12 Jul 2025, Last Modified: 12 Nov 2025ACM SIGIR Conference on Research and Development in Information RetrievalEveryoneWM2024 Conference

Abstract: In this paper, we study social bias in cross-modal text-image retrieval systems, focusing on the interaction between textual queries and image responses. Despite the significant advancements in cross-modal retrieval models, the potential for social bias in their responses remains a pressing concern, necessitating a comprehensive framework for assessment and mitigation. We introduce a novel framework for evaluating social bias in cross-modal retrieval systems, leveraging a new dataset and appropriate metrics specifically designed for this purpose. Our dataset, Social Inclusive Synthetic Professionals Images (SISPI), comprises 49K images generated using state-of-the-art text-to-image models, ensuring a balanced representation of demographic groups across various professional roles. We use this dataset to conduct an extensive analysis of social bias (gender and ethnic) in state of the art cross-modal retrieval deep models, including CLIP, ALIGN, BLIP, FLAVA, COCA, and many others. Using diversity metrics, grounded in the distribution of different demographic groups' images in the retrieval rankings, we provide a quantitative measure of fairness, facilitating a detailed analysis of models' behavior. Our work sheds light on biases present in current cross-modal retrieval systems and emphasizes the importance of training data curation, providing a foundation for future research and development towards more equitable and unbiased models. The dataset and code of our framework is publicly available at https://sispi-benchmark.github.io/sispi-benchmark/.