Rethinking logic in AI: A novel benchmark inspired by polynomial analogue of Gandy's fixed point theorem

Nechesov Andrey; Roman Schutski; Anastasiia Gavrish; Ivan Oseledets

Rethinking logic in AI: A novel benchmark inspired by polynomial analogue of Gandy's fixed point theorem

Nechesov Andrey, Roman Schutski, Anastasiia Gavrish, Ivan Oseledets

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, logic, benchmark, Gandy's fixed point theorem

TL;DR: We introduce a benchmark revealing that even advanced LLMs like GPT-4 struggle with basic logical reasoning tasks

Abstract: This paper introduces a novel benchmark for evaluating the logical reasoning capabilities of Large Language Models (LLMs), grounded in the polynomial analogue of Gandy's classical fixed point theorem. Since this theorem can be used to describe the P-complete HornSAT problem, and our benchmark is based on this theorem, our benchmark thus covers all problems from class P and shows that serious problems have already arisen in this class, not to mention those benchmarks whose complexity classes are NP-complete and NP-hard. Drawing on concepts from mathematical logic, we design a parameterized set of recursively definable problems where the objective is for LLMs to predict whether a problem belongs to an inductively definable set of polynomial complexity. By varying the parameters, we generate problem instances of differing complexity. Our experiments reveal that current state-of-the-art LLMs with zero-shots promts fail to reliably solve even the most straightforward cases despite an effective deterministic algorithm existing. Even advanced models like GPT-4 exhibit significant biases in solving benchmark problems. These findings highlight the limitations of modern LLMs as code interpreters, even in basic scenarios, and underscore the necessity for hybrid LLM/interpreter systems. Furthermore, they emphasize the importance of developing quantitative tests for reasoning, given the increasing reliance on LLM-based systems in decision-making applications.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6583

Loading