FLoRE: A Formal Language Benchmark for Logical Reasoning Evaluation

ACL ARR 2025 May Submission6523 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Logical reasoning is a fundamental capability for advanced artificial intelligence systems.As Large Language Models(LLMs) continue to improve, many studies have been conducted to provide an accurate evaluation of LLM's logical reasoning capabilities.However, current benchmarks suffer from issues including interference from commensense-knowledge, short reasoning paths and low scalability. In this work, we propose an automated, cost-efficient method for generating dataset and FLoRE, a novel benchmark utilizing formal languages, a purely symbolic reasoning system, to evaluate the logical reasoning abilities of LLMs. Experimental results indicate that current large models generally perform poorly on logical reasoning tasks and are sensitive to the symbolic meanings involved in the reasoning process.This benchmark aims to leverage the characteristics of symbolic systems to avoid interference from commonsense knowledge, simultaneously maintaining the difficulty of reasoning tasks and reducing the complexity of data construction methods. All data and code will be available online.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: logical reasoning,interpretability
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 6523
Loading