Abstract: Many regularization methods have been proposed to prevent overfitting in neural networks. Recently, a regularization method has been proposed to optimize the variational lower bound of the Information Bottleneck Lagrangian. However, this method cannot be generalized to regular neural network architectures. We present the activation norm penalty that is derived from the information bottleneck principle and is theoretically grounded in a variation dropout framework. Unlike in previous literature, it can be applied to any general neural network. We demonstrate that this penalty can give consistent improvements to different state of the art architectures both in language modeling and image classification. We present analyses on the properties of this penalty and compare it to other methods that also reduce mutual information.
TL;DR: We derive a norm penalty on the output of the neural network from the information bottleneck perspective
Keywords: Deep Learning, Natural Language Processing
Data: [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)
6 Replies
Loading