Translation Quality in Multilingual LLM Evaluation

ACL ARR 2025 July Submission1472 Authors

29 Jul 2025 (modified: 17 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Machine-translated evaluation benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs). However, translation errors in such benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. This study examines the types of translation errors that occur in benchmark translations and how they affect LLM performance. We analyze five widely used English benchmarks translated into 20 European languages, using a validated LLM-based method to identify span-level translation errors at scale. To assess the impact of these errors, we apply three complementary analyses: comparing model accuracy on corrected vs. erroneous translations, testing statistical associations between error types and model performance, and estimating how strongly they affect model outcomes. Across all methods, meaning-related errors (mistranslations) lead to lower model performance, while other accuracy errors and fluency issues show weaker and more variable effects. Our results motivate translation-aware evaluation practices and enable scalable detection and analysis of translation artifacts.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: multilingual benchmarks,multilingual evaluation,automatic evaluation,domain adaptation,benchmarking,automatic creation and evaluation of language resources,evaluation methodologies
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Bulgarian, Czech, Danish, Durch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Slovakian, Slovenian, Spanish, Swedish, English
Previous URL: https://openreview.net/forum?id=1Wa3tiU9qi
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
A2 Elaboration: As we analyze existing evaluation datasets, the introduction of additional risks is unlikely.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Sec. 1 (Span-ACES, EU20 suite, ARC, GSM8K, Hellaswag, MMLU, TruthfulQA, FLORES-200, GEMBA[-ESA])
B2 Discuss The License For Artifacts: No
B2 Elaboration: As for Span-ACES and GEMBA (NC license), we use them in a non-commercial setting and will comply to the license terms.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: We did not discuss this explicitly, but the datasets are public and widely used. To our knowledge, there is no specific elaboration of intended uses in the licenses, which are permissive or CC non-commercial licenses.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We used only existing public datasets without personal data and did not collect any additional personal data
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Appendix A (GEJ, Span-ACES), Appendix B (EU20)
B6 Statistics For Data: Yes
B6 Elaboration: Appendix A (GEJ, Span-ACES), Appendix B (EU20)
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Model parameters: Table 12 (Appendix B)
C2 Experimental Setup And Hyperparameters: N/A
C2 Elaboration: I understand that this refers to training runs. Otherwise, it can be included in final version. Of course, we will mention any HPC infrastructure used for inference in the acknowledgements.
C3 Descriptive Statistics: Yes
C3 Elaboration: We report significance results both in the main part (Table 5, Figure 3) and in the appendix (Tables 14-16)
C4 Parameters For Packages: No
C4 Elaboration: Can be included in final version (LM-Eval-Harness, vLLM)
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used generative assistants for linguistic improvements and informational purposes regarding existing methods; we did not copy and paste unchecked citations or, if any;
Author Submission Checklist: yes
Submission Number: 1472
Loading