Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Jeffrey Seely; Yuki Imajuku; Tianyu Zhao; Edoardo Cetin; Llion Jones

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, Llion Jones

11 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sudoku, creative reasoning, logical deduction, long-horizon planning, puzzle benchmarks

TL;DR: We introduce Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning.

Abstract: Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with \benchname, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles—making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/SakanaAI/Sudoku-Bench

Code URL: https://github.com/SakanaAI/Sudoku-Bench

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1528

Loading