No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang; Kehui Yao; Luyi Ma; Reza Yousefi Maragheh; Jiao Chen; Kai Zhao; Jianpeng Xu; Evren Korpeoglu; Sushant Kumar; Kannan Achan

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang, Kehui Yao, Luyi Ma, Reza Yousefi Maragheh, Jiao Chen, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Published: 24 Sept 2025, Last Modified: 11 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agent, LLM-as-a-Judge, Evaluation, Majority Voting, Recomendation System

Abstract: Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs—including GPT, Gemini, Claude, and Llama—across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark uncovers several key takeaways: (i) Anthropic Claude-3.5-sonnet achieves the highest decision confidence, (ii) Gemini-1.5-pro offers the best overall performance across categories, (iii) GPT-4o provides the most favorable latency–accuracy-cost trade-off, and (iv) GPT-OSS-20B leads among open-source models. Category-level analysis further reveals strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). Together, these findings establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs-as-judges, offering both methodological advances and actionable insights into scaling, reliability, and model family trade-offs.

Submission Number: 110

Loading