Abstract: Dropout is a simple yet effective technique to improve generalization performance and prevent overfitting in deep neural networks (DNNs). In this paper, we discuss three novel observations about dropout to better understand the generalization of DNNs with rectified linear unit (ReLU) activations: 1) dropout is a smoothing technique that encourages each local linear model of a DNN to be trained on data points from nearby regions; 2) a constant dropout rate can result in effective neural-deactivation rates that are significantly different for layers with different fractions of activated neurons; and 3) the rescaling factor of dropout causes an inconsistency to occur between the normalization during training and testing conditions when batch normalization is also used. The above leads to three simple but nontrivial improvements to dropout resulting in our proposed method "Jumpout." Jumpout samples the dropout rate using a monotone decreasing distribution (such as the right part of a truncated Gaussian), so the local linear model at each data point is trained, with high probability, to work better for data points from nearby than from more distant regions. Instead of tuning a dropout rate for each layer and applying it to all samples, jumpout moreover adaptively normalizes the dropout rate at each layer and every training sample/batch, so the effective dropout rate applied to the activated neurons are kept the same. Moreover, we rescale the outputs of jumpout for a better trade-off that keeps both the variance and mean of neurons more consistent between training and test phases, which mitigates the incompatibility between dropout and batch normalization. Compared to the original dropout, jumpout shows significantly improved performance on CIFAR10, CIFAR100, Fashion- MNIST, STL10, SVHN, ImageNet-1k, etc., while introducing negligible additional memory and computation costs.
Keywords: Dropout, deep neural networks with ReLU, local linear model
TL;DR: Jumpout applies three simple yet effective modifications to dropout, based on novel understandings about the generalization performance of DNN with ReLU in local regions.
Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [CIFAR-100](https://paperswithcode.com/dataset/cifar-100), [Fashion-MNIST](https://paperswithcode.com/dataset/fashion-mnist), [ImageNet](https://paperswithcode.com/dataset/imagenet), [ImageNet-1K](https://paperswithcode.com/dataset/imagenet-1k-1), [STL-10](https://paperswithcode.com/dataset/stl-10), [SVHN](https://paperswithcode.com/dataset/svhn)
9 Replies
Loading