Abstract: In this paper, we propose a novel approach to interpret a well-trained classification model through systematically investigating effects of its hidden units on prediction making. We search for the core hidden units responsible for predicting inputs as the class of interest under the generative Bayesian inference framework. We model such a process of unit selection as an Indian Buffet Process, and derive a simplified objective function via the MAP asymptotic technique. The induced binary optimization problem is efficiently solved with a continuous relaxation method by attaching a Switch Gate layer to the hidden layers of interest. The resulted interpreter model is thus end-to-end optimized via standard gradient back-propagation. Experiments are conducted with two popular deep convolutional classifiers, respectively well-trained on the MNIST dataset and the CI- FAR10 dataset. The results demonstrate that the proposed interpreter successfully finds the core hidden units most responsible for prediction making. The modified model, only with the selected units activated, can hold correct predictions at a high rate. Besides, this interpreter model is also able to extract the most informative pixels in the images by connecting a Switch Gate layer to the input layer.
4 Replies
Loading