Training wide residual networks for deployment using a single bit for each weightDownload PDF

15 Feb 2018 (modified: 15 Sept 2024)ICLR 2018 Conference Blind SubmissionReaders: Everyone
Abstract: For fast and energy-efficient deployment of trained deep neural networks on resource-constrained embedded hardware, each learned weight parameter should ideally be represented and stored using a single bit. Error-rates usually increase when this requirement is imposed. Here, we report large improvements in error rates on multiple datasets, for deep convolutional neural networks deployed with 1-bit-per-weight. Using wide residual networks as our main baseline, our approach simplifies existing methods that binarize weights by applying the sign function in training; we apply scaling factors for each layer with constant unlearned values equal to the layer-specific standard deviations used for initialization. For CIFAR-10, CIFAR-100 and ImageNet, and models with 1-bit-per-weight requiring less than 10 MB of parameter memory, we achieve error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top-1 / Top-5) respectively. We also considered MNIST, SVHN and ImageNet32, achieving 1-bit-per-weight test results of 0.27%, 1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error rates halve previously reported values, and are within about 1% of our error-rates for the same network with full-precision weights. For networks that overfit, we also show significant improvements in error rate by not learning batch normalization scale and offset parameters. This applies to both full precision and 1-bit-per-weight networks. Using a warm-restart learning-rate schedule, we found that training for 1-bit-per-weight is just as fast as full-precision networks, with better accuracy than standard schedules, and achieved about 98%-99% of peak performance in just 62 training epochs for CIFAR-10/100. For full training code and trained models in MATLAB, Keras and PyTorch see https://github.com/McDonnell-Lab/1-bit-per-weight/ .
TL;DR: We train wide residual networks that can be immediately deployed using only a single bit for each convolutional weight, with signficantly better accuracy than past methods.
Keywords: wide residual networks, model compression, quantization, 1-bit weights
Code: [![github](/images/github_icon.svg) McDonnell-Lab/1-bit-per-weight](https://github.com/McDonnell-Lab/1-bit-per-weight) + [![Papers with Code](/images/pwc_icon.svg) 3 community implementations](https://paperswithcode.com/paper/?openreview=rytNfI1AZ)
Data: [ImageNet-32](https://paperswithcode.com/dataset/imagenet-32), [MNIST](https://paperswithcode.com/dataset/mnist)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/training-wide-residual-networks-for/code)
12 Replies

Loading