Can Transformers Reason Logically? A Study in SAT Solving

Leyan Pan; Vijay Ganesh; Jacob Abernethy; Chris Esposo; Wenke Lee

Can Transformers Reason Logically? A Study in SAT Solving

Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We theoretically show that Decoder-only Transformers can solve bounded 3-SAT instances with Chain-of-Thought

Abstract: We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT). Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.

Lay Summary: Can Large Language Models Really “Think” Logically? Problem. Transformer language models can write fluent text, yet we still don’t know if they can follow a formal line of reasoning. That gap matters whenever we want AI systems that explain or verify their own steps. Approach. We focused on a classic logic puzzle called 3-SAT, where one must set every variable true / false so the whole formula holds. First, we proved (on paper) that a decoder-only Transformer with built-in “chain-of-thought’’ prompts can solve any 3-SAT instance up to a chosen size by guessing, deducing, and backtracking—exactly like a human doing trial-and-error. Next, we compiled those proof constructions into real model weights with our tool PARAT, and the resulting network solved every test case. Finally, we asked a standard Transformer to learn that reasoning just from example traces. Discovery. The trained model handled fresh logical puzzles of the same length, confirming that Transformers possess the inherent capabilites for logical deduction, but performs unreliably for larger puzzles. Closing that “length-generalization’’ gap is the next step toward safer, provably reliable AI.

Link To Code: N/A

Primary Area: Theory->Deep Learning

Keywords: Transformers, Logical Reasoning, SAT-Solving

Submission Number: 7696

Loading