MAREval: A Multi-Agent Framework for Evaluating Natural Language Recommendation Explanations

Reza Yousefi Maragheh; Jayesh Uddhav Kudase; Aysenur Inan; Ramin Giahi; Kai Zhao; Jianpeng Xu; Jason Cho; Evren Korpeoglu; Sushant Kumar

MAREval: A Multi-Agent Framework for Evaluating Natural Language Recommendation Explanations

Reza Yousefi Maragheh, Jayesh Uddhav Kudase, Aysenur Inan, Ramin Giahi, Kai Zhao, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Recommendation Explanation, Natural Language Evaluation, Multi-agent evaluation, Large Language Models

Abstract: Evaluating natural language explanations in recommender systems is essential for fostering user trust, transparency, and engagement. However, existing evaluation approaches like human evaluations, while accurate, are resource-intensive and impractical at the scale required by modern recommendation platforms. Also, automated methods using single-agent LLMs suffer from prompt sensitivity and inconsistent outputs. To address these challenges, we propose MAREval, a structured multi-agent framework for evaluating recommendation explanations using large language models. MAREval orchestrates (i) a planner agent that uses a novel Chain of Debate (CoD) prompting strategy to coordinate agent roles and enforce logically consistent evaluation plans; (ii) a moderator agent that regulates discussions by mitigating prompt drift; and (iii) an arbitrator agent that aggregates outputs from multiple evaluation rounds and (iv) a Monte Carlo sampling method, improving robustness and alignment with human judgment. We conduct comprehensive evaluations on both public (TopicalChat) and proprietary recommendation datasets, demonstrating that MAREval outperforms state-of-the-art baselines. Comprehensive experiments on a public benchmark and a proprietary e‑commerce dataset show that MAREval improves alignment with human judgments over strong single‑ and multi‑agent baselines. Stability analyses indicate substantially lower variability across repeated trials. In a large human‑annotation gate, MAREval meets production quality thresholds where prior evaluators fall short, and online A/B testing demonstrates statistically significant improvements in engagement and revenue metrics. These results establish MAREval as a scalable and reliable solution for human-aligned evaluation of recommendation explanations in real-world systems.

Submission Number: 76

Loading