Keywords: Large Language Models, Legal AI, Judicial Reasoning, Case-level Reasoning, Chinese Legal System
Abstract: While Large Language Models (LLMs) have achieved high accuracy on isolated legal QA and "exam-style" benchmarks, their reliability in handling the interdependent, procedural workflows of real-world professional legal practice remains largely unproven. To address this gap, we introduce JurisBench, a vertical, depth-oriented benchmark designed to evaluate legal LLMs across the full lifecycle of Chinese civil litigation. JurisBench introduces a Linear Depth Simulation track that mirrors the cognitive workflow of professional judges through four sequential, dependency-aware phases: Cause of Action prediction, Focus of Disputes prediction, Rationale of the Judgment prediction and Result of the Judgment prediction. Experimental results from state-of-the-art LLMs reveal a stark "illusion of competence": while models excel in isolated generative tasks, their performance collapses in an end-to-end pipeline due to substantial error propagation. We identify precise statutory grounding as a persistent bottleneck, highlighting a critical gap between fluent linguistic output and practical judicial reliability. JurisBench provides a diagnostic framework for developing more robust, workflow-aware legal AI. JurisBench provides a principled framework and a diagnostic testbed for developing next-generation legal AI capable of professional-grade adjudication.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, legal NLP, evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 883
Loading