Split and Merge: Aligning Position Biases in LLM-based Evaluators

Split and Merge: Aligning Position Biases in LLM-based Evaluators

ACL ARR 2024 June Submission1711 Authors

14 Jun 2024 (modified: 05 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, taking into account both length and semantics, and merges them back into a single prompt for evaluation by LLMs. Extensive experiments with six LLMs on 11,520 answer pairs demonstrate that PORTIA markedly enhances the consistency rates for all models and forms of comparison tested, achieving an average relative improvement of 47.46\%. It also enables GPT-3.5 to achieve performance comparable to GPT-4 and elevates GPT-4's consistency rate up to 98\%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass standalone GPT-4 in terms of alignment with human evaluators, highlighting PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost efficiency.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Large Language Models, Bias

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 1711

Loading