Constrained Wikigame: Benchmarking Deductive Reasoning for Multi-Step Planning

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: reasoning, large language models, benchmark
TL;DR: Constrained Wikigame, a Wikipedia-link navigation benchmark where an LLM must reach a target page while never stepping through intermediate pages from a banned category
Abstract: Benchmarking LLMs on multi-step planning tasks typically relies on final answer accuracy. This results in evaluation that fails to distinguish correct reasoning from lucky outcomes. We introduce Constrained Wikigame, a benchmark that extends the classic Wikigame (navigating Wikipedia from a source to a target article via hyperlinks) by introducing category constraints. This addition transforms a task where memorization and shortest-path heuristics may drive success into a step-level deduction task, as each decision involves explicitly justifying consistency with the constraint. We benchmark a suite of frontier reasoning and thinking models using both outcome level (success rate, constraint violation and path efficiency) as well as reasoning validity, directly testing whether extended reasoning translates into reliable constrained planning.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 111
Loading