Can Pre-trained Models Really Generate Single-Step Textual Entailment?Download PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: We investigate the task of generating textual entailment (GTE). Different from prior works on recognizing textual entailment, also known as NLI, GTE requires the models with deeper reasoning capabilities - generating entailment from premises rather than making prediction on given premises and the entailment. We argue that existing adapted datasets are limited and inadequate to train and evaluate human-like reasoning in the GTE. In this paper, we propose a new large-scale benchmark, named \mydataset, targeted for learning and evaluating models' capabilities towards RTE. \mydataset consists of 15k instances with each containing a pair of premise statements and a human-annotated entailment. It is constructed by first retrieving instances from a knowledge base, and then augmenting each instance with several complementary instances by 7 manually crafted transformations. We demonstrate that even extensively fine-tuned pre-trainedmodels perform poorly on \mydataset. The best generator models can only generate valid textual entailment 59.1\% of times. Further, to motivate future advances, we provide detailed analysis to show significant gaps between baselines and human performance.
0 Replies

Loading