ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Haidar Khan; Hisham Abdullah Alyahya; Colton Ritchie; Yazeed Alnumay; M Saiful Bari; Bulent Yener

ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Haidar Khan, Hisham Abdullah Alyahya, Colton Ritchie, Yazeed Alnumay, M Saiful Bari, Bulent Yener

22 Sept 2024 (modified: 19 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model Evaluation, Foundation Model Evaluation, ELO Ranking

TL;DR: A suite of dynamic benchmarks for evaluating LLMs through head to head competition

Abstract: Evaluating the capabilities of Foundation Models has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations — methods that often suffer from overfitting, high costs, and biases. We introduce ZeroSumEval, a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. A key novelty is integrating automatic prompt optimization to ensure fair comparisons by eliminating biases from human prompt engineering and support arbitrary prompting strategies. Furthermore, ZeroSumEval measures AI models' abilities to self-improve from limited observations and assesses their robustness against adversarial or misleading examples during prompt optimization. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for rigorous assessment. We find ZeroSumEval correlates strongly with expensive human evaluations (Chatbot Arena) and disagrees with benchmarks with known overfitting and saturation issues. Inspecting match traces reveals models that allocate more tokens to thought processes perform strongly in games involving planning capabilities.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2582

Loading