Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations

14 Oct 2021OpenReview Archive Direct UploadReaders: Everyone
Abstract: To increase trust in artificial intelligence sys- tems, a promising research direction consists of designing neural models capable of generat- ing natural language explanations for their pre- dictions. In this work, we show that such mod- els are nonetheless prone to generating mu- tually inconsistent explanations, such as “Be- cause there is a dog in the image.” and “Be- cause there is no dog in the [same] image.”, exposing flaws in either the decision-making process of the model or in the generation of the explanations. We introduce a simple yet ef- fective adversarial framework for sanity check- ing models against the generation of incon- sistent natural language explanations. More- over, as part of the framework, we address the problem of adversarial attacks with full target sequences, a scenario that was not pre- viously addressed in sequence-to-sequence at- tacks. Finally, we apply our framework on a state-of-the-art neural natural language infer- ence model that provides natural language ex- planations for its predictions. Our framework shows that this model is capable of generating a significant number of inconsistent explana- tions.
0 Replies

Loading