TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language ModelsDownload PDF

06 Jun 2022, 11:59 (modified: 13 Oct 2022, 08:41)NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: Text Generation, Pretrained Language Model, Data Curation, Text Generation Error
TL;DR: A large dataset and benchmark tasks used to diagnostically analyze and improve the capability of pretrained language models in text generation
Abstract: In order to diagnostically analyze and improve the capability of pretrained language models (PLMs) in text generation, we propose TGEA 2.0, to date the largest dataset built on machine-authored texts by PLMs with fine-grained semantic annotations on a wide variety of pathological generation errors. We collect 170K nominal, phrasal and sentential prompts from 6M natural sentences in 3 domains. These prompts are fed into 4 generative PLMs with their best decoding strategy to generate paragraphs. 195,629 sentences are extracted from these generated paragraphs for manual annotation, where 36K erroneous sentences are detected, 42K erroneous spans are located and categorized into an error type defined in a two-level error taxonomy. We define a \textbf{Mi}nimal \textbf{S}et of \textbf{E}rror-related \textbf{W}ords (MiSEW) for each erroneous span, which not only provides error-associated words but also rationalizes the reasoning behind the error. Quality control with a pre-annotation and feedback loop is performed before and during the entire annotation process. With the diagnostically annotated dataset, we propose 5 diagnosis benchmark tasks (i.e., erroneous text detection, MiSEW extraction, erroneous span location and correction together with error type classification) and 2 pathology mitigation benchmark tasks (pairwise comparison and word prediction). Experiment results on these benchmark tasks demonstrate that TGEA 2.0 is a challenging dataset that could facilitate further research on automatic diagnosis and pathology mitigation over machine texts. The dataset will be publicly available at
Supplementary Material: pdf
Dataset Url: The dataset will be released at
Dataset Embargo: The full training dataset and a small dev/test dataset will be released at the project websites by July, 2022. The full dev/test datasets will be available when we organize the shared task with the dataset.
License: CC BY-SA 4.0
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: No
20 Replies