Disco-Bench: A Context-Aware Evaluation Benchmark for Language Modelling

Longyue Wang; DongHuai Liu; Deng Cai; Dian Yu; Haiyun Jiang; Yan Wang; Leyang Cui; Shuming Shi; Zhaopeng Tu

Disco-Bench: A Context-Aware Evaluation Benchmark for Language Modelling

Longyue Wang, DongHuai Liu, Deng Cai, Dian Yu, Haiyun Jiang, Yan Wang, Leyang Cui, Shuming Shi, Zhaopeng Tu

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Language Modelling, Benchmark, Language Understanding, Language Translation, Language Generation, Large Language Model

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Modeling large contexts, especially linguistic phenomena that span beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence contextual properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite to probe the extent to which the evaluated models have internalized contextual information. We totally evaluate 20 general-purpose and domain-specific models based on advanced pretraining architectures and large language models (LLMs). Our results show that (1) our evaluation benchmark is both challenging and necessary; (2) fine-grained pretraining with literary document-level training data consistently enhances the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5265

Loading