Abstract: Neural networks (NNs) have been adopted in a wide range of application domains, such as image classification, speech recognition, object detection, and computer vision. However, training NNs - especially deep neural networks (DNNs) - can be energy and time consuming, because of frequent data movement between processor and memory. Furthermore, training involves massive fine-grained operations with various computation and memory access characteristics. Exploiting high parallelism with such diverse operations is challenging. To address these challenges, we propose a software/hardware co-design of heterogeneous processing-in-memory (PIM) system. Our hardware design incorporates hundreds of fix-function arithmetic units and ARM-based programmable cores on the logic layer of a 3D die-stacked memory to form a heterogeneous PIM architecture attached to CPU. Our software design offers a programming model and a runtime system that program, offload, and schedule various NN training operations across compute resources provided by CPU and heterogeneous PIM. By extending the OpenCL programming model and employing a hardware heterogeneity-aware runtime system, we enable high program portability and easy program maintenance across various heterogeneous hardware, optimize system energy efficiency, and improve hardware utilization.
0 Replies
Loading