Keywords: benchmark, evaluations, platform, LLMs, paradigm, verifiers, generator-validator gap
TL;DR: We explore a radical paradigm for AI evals: assessing LLMs on unsolved questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification.
Abstract: Benchmarks shape progress in AI research.
A useful benchmark should be both *difficult* and *realistic*: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems.
In this work, we explore a radically different paradigm: assessing models on *unsolved* questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification.
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value.
Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions, enabling ongoing, asynchronous, and community-driven evaluation.
The top-performing model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed.
UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge.
Submission Number: 74
Loading