A survey of diversity quantification in natural language processing: The why, what, where and how

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0
Keywords: linguistic diversity, quantification of diversity, diversity in NLP, multilingual
Working Group: WG4: Quantifying and promoting diversity
Abstract: Diversity has been gaining increasing attention in NLP in recent years. It has become an advocated property of datasets and systems in various NLP tasks and end-user applications, and many measures are used to quantify it. Nevertheless, there have been very few attempts to take a step back and understand the conceptualization of diversity in NLP and the motivations behind its endorsement. When such attempts were made, they were limited to particular areas (Tevet and Berant, 2021; Yang et al., 2025; Zhang et al., 2025) and diversity aspects (Lion-Bouton et al., 2022; Ploeger et al., 2024). Overall, NLP belongs to the ``fields [...] where diversity is prominent in discussion, but remains undefined or analytically neglected'' (Stirling, 2007, p. 707). The objective of this survey is to pave the way toward addressing these shortcomings by taking inspiration from studies outside of NLP where diversity has been systematically analyzed (Sarkar, 2010; Stirling, 1994). Our contributions are twofold: an NLP-specific framework for conceptualizing diversity quantification and recommendations for further conceptualization of diversity in the field. (camera-ready version attached as PDF)
WG4 Tasks: Task 4.2: Survey of diversity measures
Tracks For Type Of Contribution: Complete work (including previously published work)
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 26
Loading