Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

ACL ARR 2025 May Submission6958 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI's o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation; NLP datasets; benchmarking

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data analysis

Languages Studied: English,Chinese

Keywords: Legal Capability Evaluation;Legal Reasoning;Legal Benchmarks;Test-time Scaling

Submission Number: 6958

Loading