RuleArena: A Benchmark for LLM Rule-Guided Reasoning in Real-World Scenarios

Published: 05 Mar 2025, Last Modified: 19 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Complex Reasoning, Rule Following
Abstract: This paper introduces RuleArena, a challenging benchmark to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains---airline baggage fees, NBA transactions, and tax regulations---RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate math computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. We also observe a significant performance boost when LLMs are provided with external tools. These results highlight significant challenges and promising directions in advancing LLMs' rule-guided reasoning capabilities in real-life applications.
Submission Number: 99
Loading