Keywords: large language model, evaluation, side-by-side
TL;DR: We analyze the Side-by-Side model evaluation approach for different languages, proposing a methodology to align automatic and human evaluations, while addressing positional bias with new metrics and mitigation strategies.
Submission Type: Archival
Abstract: With the rise of large language models, evaluating their outputs has become increasingly important. While supervised evaluation compares model responses to ground truths, dialogue models often use the Side-by-Side approach, where a judge compares the responses of baseline and candidate models using a predefined methodology. In this paper, we conduct an in-depth analysis of the Side-by-Side approach for evaluating models in text generation as well as for code generation and investigate the circumstances under which LLM-evaluators can be considered an alternative to expert annotation. We propose and publicly release a methodology that can enhance the correlation between automatic evaluation and human annotation through careful prompt engineering and adding model reasoning. We demonstrate the problem of positional bias and propose metrics for measuring it, as well as ways to mitigate it.
Submission Number: 29
Loading