ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Bill Yuchen Lin; Ronan Le Bras; Kyle Richardson; Ashish Sabharwal; Radha Poovendran; Peter Clark; Yejin Choi

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity.” Notably, this limitation persists even with scaling model size and inference-time computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.

Lay Summary: Large language models (LLMs), like those powering chatbots, are great at many tasks, but can they solve complex logic puzzles? We created ZebraLogic, a set of 1,000 logic grid puzzles, to test how well these models handle pure logical reasoning, similar to solving a brain teaser about who lives in which house with specific clues. Our puzzles range from simple to extremely challenging, allowing us to see how model performance changes as difficulty grows. We found that even the biggest and most advanced models struggle when puzzles get very complex—a problem we call the “curse of complexity.” Simply making models larger or giving them more tries doesn’t fully solve this. However, models that “think” step-by-step, using a process called chain-of-thought, perform better. Our work shows that teaching AI to reason more like humans, with careful backtracking, could improve their ability to tackle tough logic problems, benefiting real-world tasks like planning and scheduling.

Link To Code: https://huggingface.co/spaces/WildEval/ZebraLogic

Primary Area: Deep Learning->Large Language Models

Keywords: LLM, Reasoning, Scaling

Submission Number: 9246

Loading