Measuring Diversity in Datasets

ICLR 2024 Workshop DMLR Submission33 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: measurement theory, dataset collection, machine learning datasets
Abstract: Machine learning (ML) datasets, often perceived as "neutral," inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these terms lack clear definitions and validation in datasets. Our research explores the implications of this issue, specifically analyzing ``diversity'' across 135 image and text datasets. Drawing from social sciences, we leverage principles from measurement theory to pinpoint considerations and offer recommendations on conceptualization, operationalization, and evaluation of diversity in ML datasets. Our recommendations extend to broader implications for ML research, advocating for a more nuanced and well-defined approach to handling value-laden properties in dataset construction.
Primary Subject Area: Other
Paper Type: Research paper: up to 8 pages
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 33
Loading