Keywords: Benchmark, law, ToM
Abstract: Though Large language models (LLMs) excel at legal document retrieval and case summarization, their strategic theory-of-mind (ToM) reasoning in legal tasks remains largely uninvestigated. This paper examines LLMs of different roles in U.S. Supreme Court oral arguments, where \textit{Justices} challenge \textit{Advocates} through adversarial and strategically framed questioning before making final decisions.
We introduce \ours, the first benchmark designed to evaluate LLMs ToM reasoning in legal contexts, using U.S. Supreme Court oral arguments as a natural testbed.
In \ours, a \textit{Justice agent} predicts future lines of inquiry and produces legally coherent, strategically structured questions. An \textit{Advocate agent} interprets judicial intentions and generates persuasive responses that adapt to the evolving context.
We evaluate them in four tasks: understanding intentions of justices, predicting subsequent questions from justices, and generating contextually adaptive responses for both justices and advocates.
Evaluations of a diverse range of open-source and proprietary LLMs show that while larger models and extended context windows yield consistent improvements, even the strongest systems fall short of expert human performance.
\ours provides a rigorous testbed for exploring the emergence and boundaries of multi-agent ToM reasoning and human-AI interactions in professional legal processes.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 7085
Loading