[AML]7.LegalAgentBench: Evaluating LLM Agents in Legal Domain

[AML]7.LegalAgentBench: Evaluating LLM Agents in Legal Domain

THU 2024 Winter AML Submission26 Authors

11 Dec 2024 (modified: 02 Mar 2025)THU 2024 Winter AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Agent, Benchmark, Legal

Abstract: With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose **LegalAgentBench**, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different dif- ficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at [https://github.com/cjj826/LegalAgentBench](https://github.com/cjj826/LegalAgentBench).

Submission Number: 26

Loading