AI Passes Humanity’s Last Exam and Generates Video Tutorials

Agents4Science 2025 Conference Submission236 Authors

15 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation models and reasoning LLMs, LLM datasets and benchmarks, Math education
TL;DR: AI passes Humanity’s Last Exam using best-of-N rejection sampling and model routing, plus auto-generated math video tutorials
Abstract: We demonstrate that AI passes Humanity's Last Exam by Best-of-$N$ rejection sampling and using appropriate models for different question categories. Specifically, we pass the HLE with 53\% accuracy, without online search, for a cost of around \$3 per question and running time of less than 5 minutes per question, verified by humans on a random sample of 100 questions. We compare the answers and performance of different models and methods and analyze their similarities and differences, finding which pairs of models give the same wrong answers. For human understanding, we use AI to generate educational videos explaining the HLE math questions and their answers. An expert Mathematician curates and analyzes a subset of the most challenging math HLE questions that AI has yet to solve, providing insights into current limitations.
Supplementary Material: zip
Submission Number: 236
Loading