Keywords: reasoning, logical reasoning, large language models, evaluation, benchmark
TL;DR: RelEval is a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over dynamically generated complex relational structures.
Abstract: We introduce RelEval, a benchmark for evaluating large language models (LLMs) in logical reasoning over complex relational structures. Such reasoning underpins applications where LLMs generate or query structured graphs, including network infrastructure, knowledge bases, and business process schemas. Our framework enables fine-grained control of task difficulty by varying the number of objects,
relations, and the depth of relational chains. RelEval encompasses three complementary tasks: (1) Plan Generation, requiring construction of valid directed relational graphs under structural constraints; (2) Consistency Detection, detecting inconsistencies in relational structures; and (3) Comparison Question, assessing the validity of queried relationships. We also test models’ self-correction by prompting them to verify and refine their answers. We evaluate DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet, finding large performance gaps linked to model scale and architecture. While recent reasoning-focused models excel on simpler cases, they struggle with more complex configurations requiring deeper reasoning.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 16014
Loading