The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

ACL ARR 2025 July Submission40 Authors

18 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) increasingly bridge linguistic boundaries, robust multilingual evaluation has become critical to ensuring equitable technological development. This position paper analyzes over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to assess past, present, and future practices in multilingual benchmarking. Historically, we observe that a large amount of resources has been invested in creating multilingual benchmarks, yet English is still the most dominant language and many fail to address the needs of underrepresented languages and domains. Currently, we identify the most common use cases of LLMs and observe that benchmarks often lack real-world applicability and fail to align with human judgments, highlighting a disconnect between evaluation frameworks and practical utility. To address these gaps, we propose six essential characteristics for effective benchmarks, including accuracy, resistance to contamination, appropriate challenge levels, practical relevance, linguistic diversity, and cultural authenticity. We also outline five critical research directions, including improving natural language generation evaluation, expanding low-resource language coverage, and developing culturally authentic benchmarks. Our findings highlight a concerning trend: while significant resources are invested in multilingual benchmarks, many fail to reflect real-world applications or align with human judgments. Through this structured analysis, we advocate for evaluation frameworks that better represent global linguistic diversity, ensuring that language technologies serve all communities equitably.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: multilingualism, evaluation, LLMs

Contribution Types: Data analysis, Position papers, Surveys

Languages Studied: English, Chinese, Russian, German, Spanish, French

Submission Number: 40

Loading