Abstract: Deep learning trends to use low precision numeral formats to cope with the ever-growing model sizes. For example, the large language model LLaMA2 has been widely deployed in 4-bit precision. With larger models and fewer unique values caused by low precision, an increasing proportion of arithmetic in matrix multiplication is repeating. Although discussed in prior works, such value redundancy has not been fully exploited, and the cost to leverage the value redundancy often offsets any advantages. In this paper, we propose to primitivize the matrix multiplication, that is decomposing it down to the 1-ary successor function (a.k.a. counting) to merge repeating arithmetic. We revisited various techniques to propose Cambricon-C SA, a 4-bit primitive matrix multiplication unit that doubles the energy efficiency over conventional systolic arrays. Experimental results show that Cambricon-C SA can achieve $\mathbf{1}.\mathbf{95}\times$ energy efficiency improvement compared with MAC-based systolic array.
Loading