Keywords: Quantization, Hardware Acceleration, Deep Learning
Abstract: We introduce a software-hardware co-design approach to reduce memory traffic and footprint during training with BFloat16 or FP32, in order to boost energy efficiency and execution time performance. Our methods dynamically adjust the size and format of the floating-point containers used to store activations and weights during training. The different value distributions lead us to different approaches for exponents and mantissas. Gecko exploits the favourable exponent distribution with a lossless delta encoding approach to reduce the total exponent footprint by up to 58% in comparison to the FP32 baseline. To contend with the noisy mantissa distributions, we present two lossy methods to eliminate as many as possible least significant bits without affecting accuracy. Quantum Mantissa is a machine learning mantissa compression method that taps onto the gradient descent algorithm to learn the minimal mantissa bitlengths on a per-layer granularity, and obtain up to 92% reduction in total mantissa footprint. Alternatively, BitChop observes changes in the loss function during training to adjust mantissa bitlength network-wide, yielding a reduction of 81% in footprint. Schrödinger's FP implements hardware encoders/decoders that, guided by Gecko/Quantum Mantissa or Gecko/BitChop, transparently encode/decode values when transferring to/from off-chip memory, boosting energy efficiency and reducing execution time.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
TL;DR: Reducing the training cost and time by learning and using, on-the-fly, shorter floating-point formats
21 Replies
Loading