Abstract: Official court press releases from Germany’s highest courts are vital for bridging complex judicial rulings and the public. Prior efforts on German legal text summarization in NLP emphasize technical headnotes, often ignoring the need for citizen oriented communication. We introduce CourtPressGER, a 6.4k triple dataset of rulings, their human-drafted press releases, and synthetic contextual generation prompts for LLMs to generate comparable press releases. The resulting benchmark dataset is intended to train and evaluate LLMs in generating accurate, more readable summaries from long judicial texts. We benchmark a set of small and large LLMs on the task and evaluate model outputs via reference-based metrics, factual-consistency checks, and an LLM-as-judge approach that approximates expert review. We further conduct qualitative expert analysis and ranking. Results show that large LLMs produce near-human-quality drafts and only marginally lose performance when applied hierarchically. Smaller models require a hierarchical setup to be able to summarize long judgments, and achieve a range of scores. All models struggle with factual consistency, and the human drafted press release is consistently ranked highest.
Paper Type: Short
Research Area: Summarization
Research Area Keywords: extractive summarisation, abstractive summarisation, multimodal summarization, long-form summarization, evaluation, factuality
Contribution Types: Data resources, Data analysis
Languages Studied: German
Submission Number: 6652
Loading