TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer

Minxuan Zhou, Weihong Xu, Jaeyoung Kang, Tajana Rosing

Published: 2022, Last Modified: 12 May 2023HPCA 2022Readers: Everyone

Abstract: Transformer-based models are state-of-the-art for many machine learning (ML) tasks. Executing Transformer usually requires a long execution time due to the large memory footprint and the low data reuse rate, stressing the memory system while under-utilizing the computing resources. Memory-based processing technologies, including processing in-memory (PIM) and near-memory computing (NMC), are promising to accelerate Transformer since they provide high memory bandwidth utilization and extensive computation parallelism. However, the previous memory-based ML accelerators mainly target at optimizing dataflow and hardware for compute-intensive ML models (e.g., CNNs), which do not fit the memory-intensive characteristics of Transformer. In this work, we propose TransPIM, a memory-based acceleration for Transformer using software and hardware co-design. In the software-level, TransPIM adopts a token-based dataflow to avoid the expensive inter-layer data movements introduced by previous layer-based dataflow. In the hardware-level, TransPIM introduces lightweight modifications in the conventional high bandwidth memory (HBM) architecture to support PIM-NMC hybrid processing and efficient data communication for accelerating Transformer-based models. Our experiments show that TransPIM is 3.7× to 9.1× faster than existing memory-based acceleration. As compared to conventional accelerators, TransPIM is 22.1× to 114.9× faster than GPUs and provides 2.0× more throughput than existing ASIC-based accelerators.

0 Replies