REDTABS: A Collection of Report Document Datasets for Long Text and Multi-Table SummarizationDownload PDF


16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Automatic document summarization aims to produce a concise summary covering the input document's salient content. Within a report document, both the textual and non-textual content (e.g., tables and figures) can be important information sources for the summary. However, most available document summarization datasets focus on the text and filter out the non-textual content. Missing tabular data can limit the informativeness of produced summaries, especially when target summaries require to cover quantitative descriptions of critical metrics, whose numerical information is usually kept in tables. In this paper, we address this issue by introducing REDTABS, the first collection of large-scale datasets for long text and multi-table summarization. Built on companies' annual reports, it includes three large-scale datasets for summarizing these companies' business, results of operations, and overall conditions, respectively. We also present the Segment-Alignment-based long Text and multi-Table summarization (SATT) method incorporating textual and tabular data into the summarization process. Besides, we propose a set of automatic evaluation metrics to assess the numerical information in summaries produced by summarization models. Dataset analyses and experimental results reveal the importance of incorporating textual and tabular data into the report document summarization. We will release our data and code to facilitate advances in summarization and text generation research.
0 Replies
