Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

ICLR 2026 Conference Submission14826 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Legal summarization, Long-context evaluation, Checklist-based evaluation, Agent scaffold

TL;DR: We present Gavel-Ref for comprehensive legal summarization evaluation, benchmark 12 frontier LLMs, and explore direct checklist extraction with end-to-end, chunk-by-chunk, and our agent scaffold Gavel-Agent.

Abstract: Large language models (LLMs) are increasingly applied in legal practice, with case summarization being a key long-context task where cases often exceed 100K tokens across multiple documents. Existing evaluation methods rely on checklist comparisons but use coarse-grained extraction that merges multiple values into single text blocks, missing partial matches when comparing them. They also overlook content beyond predefined checklist categories and lack writing style evaluation. In this paper, we introduce Gavel-Ref, a reference-based evaluation framework that improves checklist evaluation through multi-value extraction with supporting text, and further incorporates residual fact and writing-style assessments. Using Gavel-Ref, we move beyond the single aggregate scores reported in prior work to systematically evaluate 12 frontier LLMs on legal cases ranging from 32K to 512K tokens, primarily from 2025. Our detailed analysis reveals Gemini 2.5 Flash, GPT-5, and Claude Sonnet 4 achieve the best performance (around 50 $S_{\text{Gavel-Ref}}$), showing the difficulty of the task. These top models show consistent patterns: they succeed on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs keep improving and may eventually surpass human summaries, we also explore checklist extraction directly from case documents. We experiment with three different methods: end-to-end with long-context LLM, chunk-by-chunk extraction, and our newly developed autonomous agent scaffold, Gavel-Agent. Results show a trade-off between performance and efficiency: GPT-4.1 end-to-end performs best, while Gavel-Agent with Qwen3 reduces token usage by about 50\%. We will release our code and annotations publicly to facilitate future research on long-context legal summarization.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14826

Loading