FLoRE: A Formal Language Benchmark for Logical Reasoning Evaluation

FLoRE: A Formal Language Benchmark for Logical Reasoning Evaluation

ACL ARR 2025 May Submission6523 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Logical reasoning is a fundamental capability for advanced artificial intelligence systems.As Large Language Models(LLMs) continue to improve, many studies have been conducted to provide an accurate evaluation of LLM's logical reasoning capabilities.However, current benchmarks suffer from issues including interference from commensense-knowledge, short reasoning paths and low scalability. In this work, we propose an automated, cost-efficient method for generating dataset and FLoRE, a novel benchmark utilizing formal languages, a purely symbolic reasoning system, to evaluate the logical reasoning abilities of LLMs. Experimental results indicate that current large models generally perform poorly on logical reasoning tasks and are sensitive to the symbolic meanings involved in the reasoning process.This benchmark aims to leverage the characteristics of symbolic systems to avoid interference from commonsense knowledge, simultaneously maintaining the difficulty of reasoning tasks and reducing the complexity of data construction methods. All data and code will be available online.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: logical reasoning,interpretability

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 6523

Loading