Super Floating-Point (SuFP): Efficient To All. Multi-Region Piecewise Quantization using Scalable Bias with Hardware Optimization

Geonwoo Ko; Sungyeob Yoo; Seri Ham; Seeyeon Kim; Joo-Young Kim

Super Floating-Point (SuFP): Efficient To All. Multi-Region Piecewise Quantization using Scalable Bias with Hardware Optimization

Geonwoo Ko, Sungyeob Yoo, Seri Ham, Seeyeon Kim, Joo-Young Kim

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Post-Training Quantization, Piecewise Quantization, Block Floating-Point Quantization, Hardware-Friendly Data Type

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: As Deep Neural Networks (DNNs) revolutionize various application domains, their model size and computational demand also increase exponentially. In response to these challenges, various quantization techniques have emerged as highly effective solutions. However, quantization methods using conventional data types, including integer or floating-point, face certain limitations in balancing between accuracy drop and computational benefit. In light of the advent of hardware accelerator design for AI processing, quantization research has entered a new phase: custom data types and specialized hardware have emerged as innovative alternatives. Particularly, piecewise quantization and block floating-point quantization exhibit notable performance and efficiency improvements, but they still suffer from handling outliers with huge dynamic ranges. To solve this issue, we introduce Super Floating-Point (SuFP), a breakthrough data type and quantization method that improves both memory footprint and logic efficiency without compromising model accuracy. The key idea of SuFP is multi-region piecewise quantization using a tensor-wise scalable bias. It can configure an optimized precision for each region to capture both dense near-zero data and outliers. In addition, the scalable bias offers flexible adaptability to diverse data distributions, requiring only a single addition operation at the tensor level. Furthermore, the tailored hardware for SuFP employs only integer arithmetic units and shifters, facilitating a highly compact hardware realization. Our experimental results show that SuFP quantization achieved accuracy performance on par with, and in some cases even exceeded, that of full precision floating-point (FP32) across vision, language, and generative model benchmarks. Its computational capability and energy efficiency have been dramatically improved by 9.00$\times$ and 17.04$\times$ over FP32 implementations, surpassing state-of-the-art MSFP and BSFP, up to 7.20$\times$ and up to 2.06$\times$, respectively.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4943

Loading