Does Entity Abstraction Help Generative Transformers Reason?

Published: 20 Nov 2022, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We study the utility of incorporating entity type abstractions into pre-trained Transformers and test these methods on four NLP tasks requiring different forms of logical reasoning: (1) compositional language understanding with text-based relational reasoning (CLUTRR), (2) abductive reasoning (ProofWriter), (3) multi-hop question answering (HotpotQA), and (4) conversational question answering (CoQA). We propose and empirically explore three ways to add such abstraction: (i) as additional input embeddings, (ii) as a separate sequence to encode, and (iii) as an auxiliary prediction task for the model. Overall, our analysis demonstrates that models with abstract entity knowledge performs better than without it. The best abstraction aware models achieved an overall accuracy of 88.8% and 91.8% compared to the baseline model achieving 62.9% and 89.8% on CLUTRR and ProofWriter respectively. However, for HotpotQA and CoQA, we find that F1 scores improve by only 0.5% on average. Our results suggest that the benefit of explicit abstraction is significant in formally defined logical reasoning settings requiring many reasoning hops, but point to the notion that it is less beneficial for NLP tasks having less formal logical structure.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: List of modifications since submitted version: - - - - clarification of the strategy we use in Section 1 - reformulation of the loss equation in Section 3.3 - clarification of the third paragraph in Section 3 - explanation of why the best model in Table2 & Table3 are not the same in Sections 4.1 & 4.2 - additional relevant prior work in Section 2 - prediction with multiple random seed to include mean & standard deviation in Table 3 & Table 4 - - - - new Section (5 Discussion) in which we added additional statistics for all our datasets such as: - the average proportion of tokens being tagged as entities. - the performance of the entity tagger (precision and recall) - CLUTRR results with weaker entity tagger (10% noise). - - - - additional experiments on the GeoQuery dataset in the appendix - error analysis of predictions on CLUTRR test lvl.3 in the appendix - - - - performance of models on CLUTRR & ProofWriter with weaker tagging accuracy (25%, 50%, 75% noise) - contextualization of results with previous work in last paragraphs of Section 4.3 & 4.4. - error analysis of predictions that are the same / different than the baseline for HotpotQA in Appendix F2.
Assigned Action Editor: ~Angeliki_Lazaridou2
Submission Number: 166