NEWSFARM: the Largest Chinese Corpus for Long News SummarizationDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Recently, driven by a large number of datasets, the field of natural language processing(NLP) has developed rapidly. However, the lack of large-scale and high-quality Chinese datasets is still a critical bottleneck for further research on automatic text summarization. To close this gap, we searched Chinese news websites of domestic and abroad media, designed the algorithm HSS(hidden text topic, semantic similarity, and syntactic similarity) to crawl and filter these records to construct NEWSFARM. NEWSFARM is the largest highest quality Chinese long news summarization corpus, containing more than 200K Chinese long news and summaries written by professional editors or authors, which are all released to the public. Based on the corpus, we calculated the static metrics and designed many experiments with the baseline models. By comparing with the common datasets, the experiment results show that the high quality of our dataset and training effect of the models, which not only demonstrates the usefulness and challenges of the proposed corpus for automatic text summarization but also provides a benchmark for further research.
0 Replies

Loading