Keywords: Natural Language Reasoning, Dataset Generator, LLM Reasoning, Boolean Satisfiability (SAT), SAT Reasoning with LLMs, propositional logic based text generator
TL;DR: Variable-complexity dataset generator for SAT tasks in natural language to evaluate the performance of LLMs/LRMs in determining logical consistency. Results indicate bias of some models towards predicting a set of inconsistent clauses as consistent.
Abstract: A key challenge for LLMs lies in their ability to reason. To evaluate an LLMs' reasoning capabilities, there is a need for challenging natural language datasets for reasoning tasks. However, it is hard to manually generate these datasets across all domains, in the scale required by LLMs, as they require expensive effort from subject matter experts. As datasets become public, they become part of the training data of LLMs leading to the need for newer datasets. In this work, we formalize the problem of synthetic generation of SAT reasoning-tasks of variable complexity in natural language that adheres to propositional logic. Then, we present our method, LTGEN (Logical Text Generator), to generate custom datasets aligned with our formalism. We test $\texttt{GPT-4o}$ and $\texttt{o3-mini}$ on two auto-generated datasets using LTGEN and find that LLMs struggle particularly on hard UNSAT problems and are biased towards predicting that the text is logically consistent.
Submission Number: 83
Loading