SummQual: A Dataset of Human Evaluation on Large-Scale Language Model Summarization QualityDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: To measure how well large-scale language models can produce summaries, we collected human judgments on their quality from a diverse and extensive dataset.
Abstract: Large Language Models (LLMs) have shown impressive performance on various natural language processing tasks, including text summarization. However, evaluating the quality of the summaries generated by LLMs is challenging, as automatic metrics often do not correlate well with human judgments. In this work, we present SummQual, the largest human evaluation dataset of multi-domain summarization systems to date, featuring 6k document-summary pairs in the test set and 30k training pairs. Our dataset evaluates several state-of-the-art LLM systems, such as GPT-4, Bard, and Vicuna. Unlike most existing datasets that focus on the news domain, our dataset covers nine diverse domains: Wikipedia, News TV, Pubmed, Reddit, Youtube videos, supreme court cases, clinical dialogues, and financial reports. To avoid overlap with LLMs' training data, SummQual collects documents from the most recent public online sources, starting from the year 2023. Furthermore, this dataset contains not only common summary quality annotations, e.g., relevance and coherence, but also fine-grained human feedback on hallucinated spans. We believe SummQual can elicit a deeper understanding of LLM's summarization capability and promote research in text summarization as well as hallucination detection and mitigation.
Paper Type: long
Research Area: Summarization
Contribution Types: Data resources
Languages Studied: English
0 Replies

Loading