RelEval: A Structured Benchmark for Logical and Relational Reasoning in LLMs

RelEval: A Structured Benchmark for Logical and Relational Reasoning in LLMs

ICLR 2026 Conference Submission16014 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, logical reasoning, large language models, evaluation, benchmark

TL;DR: RelEval is a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over dynamically generated complex relational structures.

Abstract: We introduce RelEval, a benchmark for evaluating large language models (LLMs) in logical reasoning over complex relational structures. Such reasoning underpins applications where LLMs generate or query structured graphs, including network infrastructure, knowledge bases, and business process schemas. Our framework enables fine-grained control of task difficulty by varying the number of objects, relations, and the depth of relational chains. RelEval encompasses three complementary tasks: (1) Plan Generation, requiring construction of valid directed relational graphs under structural constraints; (2) Consistency Detection, detecting inconsistencies in relational structures; and (3) Comparison Question, assessing the validity of queried relationships. We also test models’ self-correction by prompting them to verify and refine their answers. We evaluate DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet, finding large performance gaps linked to model scale and architecture. While recent reasoning-focused models excel on simpler cases, they struggle with more complex configurations requiring deeper reasoning.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 16014

Loading