A 28nm 4.35TOPS/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

Chen Tang, Xinyuan Lin, Zongle Huang, Wenyu Sun, Hongyang Jia, Yongpan Liu

Published: 01 Jan 2024, Last Modified: 16 May 2025VLSI Technology and Circuits 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The area-efficient Transformer accelerator exploiting matrix redundancy is presented with four features: 1) A proposed basis-vector decomposition sparing 25.5x model storage for Transformer like Bert-Base, allowing full on-chip inference on devices with about 13MB memory like smartphones, at only 1.28% accuracy loss. 2) An area-efficient self-programming LUT-assisted computing cell by result prefetch; 3) A unified task-insensitive core supporting fast decomposed computing, resulting in a remarkable 73% energy saving; 4) A NoC design facilitating hybrid data reuse to reduce communication. It achieves 4.35 $\text{TOPS} /\text{mm}^{2}$ dense area efficiency, 4 times than the state-of-the-art counterpart at same fabrication level. It also demonstrates 213%-429% higher overall energy efficiency.