Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta; Jacob Cheung; Philip Meng; Shayan Sayyed; Austen Liao; Kevin Zhu; Sean O'Brien

Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien

Published: 05 Mar 2025, Last Modified: 06 Mar 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: LLMs, NLU, dialectal variations, dialectal bias, LLM bias, AAVE, SAE, VALUE benchmark, MultiVALUE benchmark, GLUE tasks, SuperGLUE tasks, GPT-4o, DeepSeek v3, biases, inclusive NLP, ReDIAL benchmark

TL;DR: EnDive evaluates five LLMs across 12 tasks by translating SAE datasets into five underrepresented dialects, revealing significant performance disparities and advancing dialect-aware NLP to promote equitable language technology.

Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compares these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

Submission Number: 116

Loading