EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

ACL ARR 2025 May Submission6008 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDIVE (English Diversity), a benchmark that evaluates seven state-of-the-art (SOTA) large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compares these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English (SAE). EnDIVE thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: data ethics; model bias/fairness evaluation; model bias/unfairness mitigation; ethical considerations in NLP applications; transparency; policy and governance; reflections and critiques;

Languages Studied: English

Submission Number: 6008

Loading