LLMs in front of the Supreme Court: Can LLMs understand and reason Justice and Advocate?

LLMs in front of the Supreme Court: Can LLMs understand and reason Justice and Advocate?

ACL ARR 2026 January Submission7085 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, law, ToM

Abstract: Though Large language models (LLMs) excel at legal document retrieval and case summarization, their strategic theory-of-mind (ToM) reasoning in legal tasks remains largely uninvestigated. This paper examines LLMs of different roles in U.S. Supreme Court oral arguments, where \textit{Justices} challenge \textit{Advocates} through adversarial and strategically framed questioning before making final decisions. We introduce \ours, the first benchmark designed to evaluate LLMs ToM reasoning in legal contexts, using U.S. Supreme Court oral arguments as a natural testbed. In \ours, a \textit{Justice agent} predicts future lines of inquiry and produces legally coherent, strategically structured questions. An \textit{Advocate agent} interprets judicial intentions and generates persuasive responses that adapt to the evolving context. We evaluate them in four tasks: understanding intentions of justices, predicting subsequent questions from justices, and generating contextually adaptive responses for both justices and advocates. Evaluations of a diverse range of open-source and proprietary LLMs show that while larger models and extended context windows yield consistent improvements, even the strongest systems fall short of expert human performance. \ours provides a rigorous testbed for exploring the emergence and boundaries of multi-agent ToM reasoning and human-AI interactions in professional legal processes.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 7085

Loading