Translation Quality in Multilingual LLM Evaluation

Translation Quality in Multilingual LLM Evaluation

ACL ARR 2025 July Submission1472 Authors

29 Jul 2025 (modified: 17 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Machine-translated evaluation benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs). However, translation errors in such benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. This study examines the types of translation errors that occur in benchmark translations and how they affect LLM performance. We analyze five widely used English benchmarks translated into 20 European languages, using a validated LLM-based method to identify span-level translation errors at scale. To assess the impact of these errors, we apply three complementary analyses: comparing model accuracy on corrected vs. erroneous translations, testing statistical associations between error types and model performance, and estimating how strongly they affect model outcomes. Across all methods, meaning-related errors (mistranslations) lead to lower model performance, while other accuracy errors and fluency issues show weaker and more variable effects. Our results motivate translation-aware evaluation practices and enable scalable detection and analysis of translation artifacts.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: multilingual benchmarks,multilingual evaluation,automatic evaluation,domain adaptation,benchmarking,automatic creation and evaluation of language resources,evaluation methodologies

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: Bulgarian, Czech, Danish, Durch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Slovakian, Slovenian, Spanish, Swedish, English

Previous URL: https://openreview.net/forum?id=1Wa3tiU9qi

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

A2 Elaboration: As we analyze existing evaluation datasets, the introduction of additional risks is unlikely.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Sec. 1 (Span-ACES, EU20 suite, ARC, GSM8K, Hellaswag, MMLU, TruthfulQA, FLORES-200, GEMBA[-ESA])

B2 Discuss The License For Artifacts: No

B2 Elaboration: As for Span-ACES and GEMBA (NC license), we use them in a non-commercial setting and will comply to the license terms.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: We did not discuss this explicitly, but the datasets are public and widely used. To our knowledge, there is no specific elaboration of intended uses in the licenses, which are permissive or CC non-commercial licenses.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: We used only existing public datasets without personal data and did not collect any additional personal data

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Appendix A (GEJ, Span-ACES), Appendix B (EU20)

B6 Statistics For Data: Yes

B6 Elaboration: Appendix A (GEJ, Span-ACES), Appendix B (EU20)

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Model parameters: Table 12 (Appendix B)

C2 Experimental Setup And Hyperparameters: N/A

C2 Elaboration: I understand that this refers to training runs. Otherwise, it can be included in final version. Of course, we will mention any HPC infrastructure used for inference in the acknowledgements.

C3 Descriptive Statistics: Yes

C3 Elaboration: We report significance results both in the main part (Table 5, Figure 3) and in the appendix (Tables 14-16)

C4 Parameters For Packages: No

C4 Elaboration: Can be included in final version (LM-Eval-Harness, vLLM)

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We used generative assistants for linguistic improvements and informational purposes regarding existing methods; we did not copy and paste unchecked citations or, if any;

Author Submission Checklist: yes

Submission Number: 1472

Loading