Cambricon-C: Efficient 4-Bit Matrix Unit via Primitivization

Yi Chen; Yongwei Zhao; Yifan Hao; Yuanbo Wen; Yuntao Dai; Xiaqing Li; Yang Liu; Rui Zhang; Mo Zou; Xinkai Song; Xing Hu; Zidong Du; Huaping Chen; Qi Guo; Tianshi Chen

Cambricon-C: Efficient 4-Bit Matrix Unit via Primitivization

Yi Chen, Yongwei Zhao, Yifan Hao, Yuanbo Wen, Yuntao Dai, Xiaqing Li, Yang Liu, Rui Zhang, Mo Zou, Xinkai Song, Xing Hu, Zidong Du, Huaping Chen, Qi Guo, Tianshi Chen

Published: 01 Jan 2024, Last Modified: 17 Apr 2025MICRO 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning trends to use low precision numeral formats to cope with the ever-growing model sizes. For example, the large language model LLaMA2 has been widely deployed in 4-bit precision. With larger models and fewer unique values caused by low precision, an increasing proportion of arithmetic in matrix multiplication is repeating. Although discussed in prior works, such value redundancy has not been fully exploited, and the cost to leverage the value redundancy often offsets any advantages. In this paper, we propose to primitivize the matrix multiplication, that is decomposing it down to the 1-ary successor function (a.k.a. counting) to merge repeating arithmetic. We revisited various techniques to propose Cambricon-C SA, a 4-bit primitive matrix multiplication unit that doubles the energy efficiency over conventional systolic arrays. Experimental results show that Cambricon-C SA can achieve $\mathbf{1}.\mathbf{95}\times$ energy efficiency improvement compared with MAC-based systolic array.

Loading