BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

ACL ARR 2025 May Submission7365 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing multilingual benchmarks focus primarily on language understanding tasks. There is a lack of benchmarks to measure comprehensive critical capabilities of large language models (LLMs) across diverse languages, including instruction following, reasoning, code generation, and long context understanding. To bridge this gap, we develop BenchMAX, a multi-way multilingual benchmark to evaluate LLMs' general abilities across many languages. BenchMAX consists of high-quality data samples annotated by native annotators in 17 languages covering 10 diverse tasks. Extensive experiments on BenchMAX reveal uneven utilization of core capabilities across languages, emphasizing the performance gaps that scaling model size alone does not resolve. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code will be publicly accessible.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism,multilingual benchmarks,multilingual evaluation
Contribution Types: Data resources
Languages Studied: English,Arabic,Bengali,Chinese,Czech,French,German,Hungarian,Japanese,Korean,Russian,Serbian,Spanish,Swahili,Telugu,Thai,Vietnamese
Submission Number: 7365
Loading