Keywords: geographical profiling, dataset auditing
Abstract: Building on studies documenting gender and racial biases in vision-language models, recent works show that such models often fail to generate geographically-representative images that accurately reflect different regions around the world. A common concern is that the data used to train these models is not representative, prompting the question: *which parts of the world do these training examples come from?* To answer this question, we develop a system, *GeoProfiler*, which geographically profiles multimodal datasets by mapping image-caption pairs to countries. Using location information from captions, GeoProfiler maps examples to countries with a high precision ($0.86$). We then apply *GeoProfiler* to geographically profile the English captions of the LAION dataset for $10$ common entities (e.g., house, flag, etc.). We observe the geographical distribution of $8$ entities to obey the power law distribution. The United States, the United Kingdom, and India are most represented, appearing in 53.7% of samples. Problematically, African and South American countries are severely under-represented with only 2.0 % and 4.3 % of images respectively. We also observe a high correlation between a country's GDP and frequency ($\rho=0.79$). Lastly, we analyze the diversity of images from individual countries, and find that more images does not imply higher diversity.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11045
Loading