Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations
Abstract: To increase trust in artificial intelligence sys-
tems, a promising research direction consists
of designing neural models capable of generat-
ing natural language explanations for their pre-
dictions. In this work, we show that such mod-
els are nonetheless prone to generating mu-
tually inconsistent explanations, such as “Be-
cause there is a dog in the image.” and “Be-
cause there is no dog in the [same] image.”,
exposing flaws in either the decision-making
process of the model or in the generation of
the explanations. We introduce a simple yet ef-
fective adversarial framework for sanity check-
ing models against the generation of incon-
sistent natural language explanations. More-
over, as part of the framework, we address
the problem of adversarial attacks with full
target sequences, a scenario that was not pre-
viously addressed in sequence-to-sequence at-
tacks. Finally, we apply our framework on a
state-of-the-art neural natural language infer-
ence model that provides natural language ex-
planations for its predictions. Our framework
shows that this model is capable of generating
a significant number of inconsistent explana-
tions.
0 Replies
Loading