Towards Better Multilingual Side-by-Side LLM Evaluation

Towards Better Multilingual Side-by-Side LLM Evaluation

ACL ARR 2025 February Submission6462 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the rise of large language models, evaluating their outputs has become increasingly important. While supervised evaluation compares model responses to ground truths, dialogue models often use the Side-by-Side approach, where a judge compares the responses of baseline and candidate models using a predefined methodology. In this paper, we conduct an in-depth analysis of the Side-by-Side approach for evaluating models in Russian, Arabic, as well as for code generation and investigate the circumstances under which LLM-evaluators can be considered an alternative to expert annotation. We propose and publicly release a methodology that can enhance the correlation between automatic evaluation and human annotation through careful prompt engineering and adding model reasoning. We demonstrate the problem of positional bias and propose metrics for measuring it, as well as ways to mitigate it.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: large language model, evaluation, side-by-side

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: Russian, Arabic

Submission Number: 6462

Loading