CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding Layers

Sungmin Yun, Hwayong Nam, Kwanhee Kyung, Jaehyun Park, Byeongho Kim, Yongsuk Kwon, Eojin Lee, Jung Ho Ahn

Published: 2024, Last Modified: 31 Oct 2024ICS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: An embedding layer is one of the most critical building blocks of deep neural networks, especially for recommender systems and graph neural networks. The embedding layer dominates a large portion of the total execution time due to its large memory requirements and little data reuse in operations. To accelerate the embedding layers, dual in-line memory module (DIMM) based near-data processing architectures have been proposed. They amplify bandwidth by adding a processing unit to the DIMM’s buffer. However, prior architectures have less capacity scalability due to the limited number of memory channels. Crucially, they are limited in performance improvement due to the load imbalance problem and the limitations of DIMM-based memory systems with a multi-drop bus structure between the processing units and the host.In this paper, we propose CLAY, a CXL-based scalable near-data processing architecture that accelerates general embedding layers in DNN. Breaking away from conventional memory channel structures, CLAY interconnects the DRAM modules to reduce the data transfer overhead among DRAM modules. Furthermore, we devise a dedicated memory address mapping to mitigate load imbalance in CLAY and a packet duplication scheme that enables full utilization of CLAY by reducing the required instruction transmission bandwidth. We propose a method of scaling CLAY and a software stack to use CLAY. Compared to the state-of-the-art NDP architectures of FeaNMP and G-NMP, CLAY achieves an end-to-end speedup of up to 1.87 × and 2.77 × for recommender systems and graph neural networks, respectively.