Do Current Natural Language Inference Models Truly Understand Sentences? Insights from Simple Sentences

Anonymous

Do Current Natural Language Inference Models Truly Understand Sentences? Insights from Simple Sentences

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Natural language inference (NLI) is a task to infer the relationship between a premise and a hypothesis (e.g. entailment, neutral, or contradiction), and transformer-based models perform well on current NLI datasets such as MNLI and SNLI. Nevertheless, given the complexity of the task, especially the complexity of the sentences used for model evaluations, it remains controversial whether these models can truly infer the meaning of sentences or they simply guess the answer via non-humanlike heuristics. Here, we reduce the complexity of the task using two approaches. The first approach simplifies the relationship between the premise and hypothesis by making them unrelated. A test set, referred to as Random Pair, is constructed by randomly pairing premises and hypotheses in MNLI/SNLI. Models fine-tuned on MNLI/SNLI identify a large proportion (up to 77.6%) of these unrelated statements as being contradictory. Models fine-tuned on SICK, a dataset that included unrelated premise-hypothesis pairs, perform well on Random Pair. The second approach simplifies the task by constraining the premises/hypotheses to be syntactically/semantically simple sentences. A new test set, referred to as Simple Pair, is constructed using simple sentences, such as short SVO sentences, and basic conjunction sentences. We find that models fine-tuned on MNLI/SNLI generally fail to understand these simple sentences, but their performance can be boosted by re-fine-tuning the models using only a few hundreds of samples from SICK. All models tested here, however, fail to understand the fundamental compositional binding relation between a subject and a predicate (up to ~100% error rate) for basic conjunction sentences. Taken together, the results show that models achieving high accuracy on mainstream datasets can still lack basic sentence comprehension capacity, and datasets discouraging non-humanlike heuristics are required to build more robust NLI models.

0 Replies

Loading