No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent, LLM-as-a-Judge, Evaluation, Majority Voting, Recomendation System
Abstract: Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs—including GPT, Gemini, Claude, and Llama—across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark uncovers several key takeaways: (i) Anthropic Claude-3.5-sonnet achieves the highest decision confidence, (ii) Gemini-1.5-pro offers the best overall performance across categories, (iii) GPT-4o provides the most favorable latency–accuracy-cost trade-off, and (iv) GPT-OSS-20B leads among open-source models. Category-level analysis further reveals strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). Together, these findings establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs-as-judges, offering both methodological advances and actionable insights into scaling, reliability, and model family trade-offs.
Submission Number: 110
Loading