- Abstract: Training activation quantized neural networks involves minimizing a piecewise constant training loss whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass, so that the ````“gradient” through the modified chain rule} becomes non-trivial. Since this unusual ``“gradient” is certainly not the gradient of training loss function, the following question arises naturally: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual ``gradient" given by the STE-modifed chain rule as coarse gradient. Apparently, the choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with underlying true gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a local minimum (more rigorously, a critical point) of the population loss minimization problem. Moreover, we show that a relatively poor choice of STE may lead to instability of the training algorithm near certain local minima, which is also observed in our CIFAR-10 experiments.
- Keywords: straight-through estimator, quantized activation, binary neuron
- TL;DR: We make the first theoretical justification for the concept of straight-through estimator.