LLM-WikiRace: A Benchmark for Planning and Reasoning over Real-World Knowledge Graphs

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Planning and Reasonings, Knowledge Graphs
TL;DR: LLMs play the Wikipedia game reasoning over a large knowledge graph of concepts.
Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3.1, GPT-5, and Claude Opus 4.6, which achieve the strongest results on the easy split of LLM-WikiRace. Performance drops sharply on hard difficulty: the best-performing model, Gemini-3.1, succeeds in only 29% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove.
Submission Number: 116
Loading