Keywords: Large Language Model Evaluation, Cost-Aware Evaluation, Dual-Track Rating, Swiss-Style Pairing
TL;DR: PRISM-EE enables AI models to evaluate each other with cost awareness, revealing that some models achieve 97% of top performance at 641× lower cost than traditional benchmarks suggest.
Abstract: Large Language Model evaluation faces three critical problems: static benchmarks suffer from data contamination, human-judged systems have systematic biases, and most importantly, both ignore cost, The key factor determining real-world deployment decisions. We introduce PRISM-EE (Peer-Reviewed Intelligence Scoring Methodology with Economic Evaluation), a peer-federated framework where AI models evaluate each other through specialized roles: competitors solve problems, content creators design challenges, and judges evaluate solutions. This approach generates fresh content dynamically while reducing human bias. PRISM-EE evaluates models on dual tracks: raw performance and cost efficiency. Using Swiss-style pairing, we achieve stable ratings in 25-30 matches with ±18 Elo precision, compared to 100+ matches required by existing systems. We tested 48 models across clinical reasoning, mathematics, and programming domains. Results reveal dramatic cost variations invisible to traditional benchmarks: substantial efficiency gaps between models with similar capabilities, with some models delivering 97% of top performance at just 0.16% of the cost. PRISM-EE achieves 89% judge agreement compared to 72% for human evaluators, with gaming resistance through cross-provider validation and transparent logging. The framework includes a comprehensive governance system ensuring fair evaluation opportunity for all models regardless of provider size. Our open-source framework makes economic efficiency a primary evaluation criterion, enabling better deployment decisions where both performance and cost matter.
Primary Area: datasets and benchmarks
Submission Number: 24199
Loading