FIMD Reasoning Evaluation Framework: Detecting the Deficiencies in Complex Legal Reasoning of Large Language Models

FIMD Reasoning Evaluation Framework: Detecting the Deficiencies in Complex Legal Reasoning of Large Language Models

ACL ARR 2025 February Submission8229 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) must demonstrate human-level reasoning capabilities to facilitate the broad adoption of machine learning in the legal industry. In pursuit of this goal, we introduce TortBench, a dataset for legal reasoning that contains a collection of court judgments on tort cases annotated by a legal expert with summaries of legal explanations. We demonstrate how to formulate natural language explanation tasks to enhance the adoption of LLMs in the law domain. We test the reasoning capabilities of the most advanced LLMs and report the most frequent problems in their reasoning abilities. We introduce a novel framework for detecting limitations in LLM legal reasoning, flagging critical errors that may lead to harmful consequences along with novel metrics for benchmarking of reasoning capabilities. Our framework provides a foundation for future benchmarking and the continued improvement of legal reasoning in LLMs.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: legal NLP

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: english

Submission Number: 8229

Loading