Abstract: Modeling large contexts, especially linguistic phenomena that span beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence contextual properties across a diverse set of NLP tasks, covering understanding, translation, and generation.Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite to probe the extent to which the evaluated models have internalized contextual information. We totally evaluate 20 general-purpose and domain-specific models based on advanced pretraining architectures and large language models (LLMs). Our results show that (1) our evaluation benchmark is both challenging and necessary; (2) fine-grained pretraining with literary document-level training data consistently enhances the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field.
Paper Type: long
Research Area: Special Theme (conference specific)
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data resources, Position papers
Languages Studied: English, Chinese
0 Replies
Loading