Where is the Model Looking At? - Concentrate and Explain the Network Attention

Wenjia Xu, Jiuniu Wang, Yang Wang, Guangluan Xu, Daoyu Lin, Wei Dai, Yirong Wu

2020 (modified: 13 Jun 2021)IEEE J. Sel. Top. Signal Process. 2020Readers: Everyone

Abstract: Image classification models have achieved satisfactory performance on many datasets, sometimes even better than humans. However, the model attention is unclear since the lack of interpretability. This paper investigates the fidelity and interpretability of model attention. We propose an Explainable Attribute-based Multi-task (EAT) framework to concentrate the model attention on the discriminative image area and make the attention interpretable. We introduce attributes prediction to the multi-task learning network, helping the network to concentrate attention on the foreground objects. We generate attribute-based textual explanations for the network and ground the attributes on the image to show visual explanations. The multi-modal explanation can not only improve user trust but also help to find the weakness of the network and dataset. Our framework can be generalized to any basic model. We perform experiments on three datasets and five basic models. Results indicate that the EAT framework can give multi-modal explanations that interpret the network decision. The performance of several recognition approaches is improved by guiding network attention.

0 Replies