EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

ACL ARR 2025 February Submission7126 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook variations within a language, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDiv (English Diversity), a benchmark that evaluates five widely used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers and compares these translations against rule-based methods through fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02 out of 7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English. EnDiv advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: data ethics, model bias/fairness evaluation, model bias/unfairness mitigation, ethical considerations in NLP applications, transparency, policy and governance, reflections and critiques

Languages Studied: English

Submission Number: 7126

Loading