MAPLE: Multi-Aspect Panels of LLM Evaluators for Open-Ended Questions

MAPLE: Multi-Aspect Panels of LLM Evaluators for Open-Ended Questions

ACL ARR 2026 January Submission4007 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LM-as-a-Judge, Evaluation, Pairwise Comparison, Multi-Agent, Multi-Criteria, Bradley–Terry, Crowd-BT, Analytic Hierarchy Process

Abstract: LLM-as-a-Judge, which uses LLMs to evaluate responses to open-ended questions, has seen significant growth in recent years. It has been adopted as a scalable alternative to manual human evaluation, such as crowdsourcing, which is often time-consuming and costly. However, the discrepancy between LLM-generated evaluations and human evaluations remains a critical problem in this field. To bridge this gap, we propose Multi-Aspect Panels of LLM Evaluators (MAPLE), a framework that orchestrates evaluations across multiple criteria using multiple LLMs. MAPLE integrates criterion-wise pairwise evaluations from multiple LLMs by estimating the importance of criteria and the reliability of individual evaluators. We conduct experiments with both open-source and closed-source models. Our results demonstrate that MAPLE achieves superior alignment with human evaluations compared to baselines, highlighting the importance of employing multi-agent and multi-criteria evaluation strategies.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, metrics, benchmarking, reproducibility, statistical testing for evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 4007

Loading