CSL: A Large-scale Chinese Scientific Literature Dataset for Cross-task Evaluation

Anonymous

CSL: A Large-scale Chinese Scientific Literature Dataset for Cross-task Evaluation

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Scientific literature serves as a high-quality corpus, which could provide natural annotated data for many natural language processing (NLP) research. In this work, we introduce a Chinese Scientific Literature dataset – CSL, which contains the titles, abstracts, keywords and academic fields of 400,000 papers. The rich semantic information in these scientific literature creates extensive NLP tasks and provides a natural cross-task scenario. Based on this, we present a cross-task few-shot benchmark. To evaluate the cross-task transferability of the model, we design scenarios with different aspects and difficulties. Compared with previous cross-task benchmarks, these tasks are constructed from homogeneous corpus, allowing researchers to investigate the relationships between tasks, without being disturbed by heterogeneous data sources, annotation, and other factors. We analyze the behavior of existing text-to-text models on the proposed benchmark, and reveal the challenges for cross-task generalization, which provides a valuable reference for future research. Code and data are publicly available at https://github.com/CSL-Dataset/CSL_Dataset.

0 Replies

Loading