JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice

JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice

ACL ARR 2026 January Submission883 Authors

25 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Legal AI, Judicial Reasoning, Case-level Reasoning, Chinese Legal System

Abstract: While Large Language Models (LLMs) have achieved high accuracy on isolated legal QA and "exam-style" benchmarks, their reliability in handling the interdependent, procedural workflows of real-world professional legal practice remains largely unproven. To address this gap, we introduce JurisBench, a vertical, depth-oriented benchmark designed to evaluate legal LLMs across the full lifecycle of Chinese civil litigation. JurisBench introduces a Linear Depth Simulation track that mirrors the cognitive workflow of professional judges through four sequential, dependency-aware phases: Cause of Action prediction, Focus of Disputes prediction, Rationale of the Judgment prediction and Result of the Judgment prediction. Experimental results from state-of-the-art LLMs reveal a stark "illusion of competence": while models excel in isolated generative tasks, their performance collapses in an end-to-end pipeline due to substantial error propagation. We identify precise statutory grounding as a persistent bottleneck, highlighting a critical gap between fluent linguistic output and practical judicial reliability. JurisBench provides a diagnostic framework for developing more robust, workflow-aware legal AI. JurisBench provides a principled framework and a diagnostic testbed for developing next-generation legal AI capable of professional-grade adjudication.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, legal NLP, evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: Chinese

Submission Number: 883

Loading