Abstract: With the rise of large language models, evaluating their outputs has become increasingly important. While supervised evaluation compares model responses to ground truths, dialogue models often use the Side-by-Side approach, where a judge compares the responses of baseline and candidate models using a predefined methodology. In this paper, we conduct an in-depth analysis of the Side-by-Side approach for evaluating models in Russian, Arabic, as well as for code generation and investigate the circumstances under which LLM-evaluators can be considered an alternative to expert annotation. We propose and publicly release a methodology that can enhance the correlation between automatic evaluation and human annotation through careful prompt engineering and adding model reasoning. We demonstrate the problem of positional bias and propose metrics for measuring it, as well as ways to mitigate it.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: large language model, evaluation, side-by-side
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Russian, Arabic
Submission Number: 6462
Loading