Abstract: Accurate explanations of how a trained neural network (NN) behaves are desirable for a wide range of labor-intensive activities, including troubleshooting, validation, and understanding performance issues or identifying biases. We address the problem of explaining feedforward piecewise NNs (such as CNNs with Relu) by breaking them down into their linear components. Automatic encoding of the NN structure into linear program constraints is used to extract continuous exact explanations, which help interpret model behavior over continuous regions, provide model descriptions that convey the importance of input features and answer model queries that analyze the feature contributions under user-defined constraints. Our examples show that we can extract explanations for NNs with a moderate number of layers without relying on approximations. We demonstrate that the high-level explanations can help understand the outputs of NNs and the comparative importance of features by observing the linear models. We also show how explaining continuous inputs prevents certain attacks that have been proposed against existing explainers.
Loading