Beyond Clause Count: A Study of Proof-Relevant Difficulty in LLM SAT Reasoning

Tao Jiang; Shaowei Cai

Beyond Clause Count: A Study of Proof-Relevant Difficulty in LLM SAT Reasoning

Tao Jiang, Shaowei Cai

Published: 01 Apr 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM Reasoning OralEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: LLM reasoning, SAT, UNSAT, proof complexity, Tseitin formulas, benchmark, logical reasoning

TL;DR: We show that for LLMs judging SAT/UNSAT from raw CNF, clause count alone is not enough: clause-controlled Tseitin formulas with different graph structures can induce very different UNSAT difficulty.

Abstract: SAT has recently emerged as a controlled setting for evaluating logical reasoning in large language models (LLMs). However, existing SAT-oriented benchmarks mainly vary surface-level properties such as formula size (e.g., clause count and variable count), phase-transition regime, or question format, but these axes do not directly capture refutational difficulty on UNSAT instances. We propose proof complexity as a theory-grounded lens for diagnosing LLM failures on UNSAT reasoning and instantiate this perspective in a preliminary study using clause-controlled Tseitin formulas together with two auxiliary control families, implication-chain formulas and random 3-CNF instances. Using direct CNF input, we find that aggregate accuracy often hides strong label-specific bias, and that structurally different Tseitin families can remain substantially different in difficulty even when clause budgets are matched. These results suggest that surface statistics such as clause count and variable count are insufficient for understanding LLM performance on satisfiability judgments, and motivate broader evaluations of proof-relevant hardness in formal reasoning benchmarks.

Presenter: ~Tao_Jiang23

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 202

Loading