Optimization of NUMA Aware DNN Computing System

Published: 01 Jan 2024, Last Modified: 08 Nov 2025ICIC (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Modern high-performance computing systems typically conform to the NUMA architecture, a design that negates the ‘memory wall’ issue stemming from simultaneous memory accesses facilitated by independent multiple processors. However, the efficacy of extensive computational tasks, including those in the realm of AI, hinges on the implementation of intricate memory allocation strategies within this framework. Consider the Linux operating system, where the default memory allocation employs the FT (First Touch) policy. This approach often leads to significant remote memory accesses and imbalanced memory allocation across nodes, adversely affecting the performance of Deep Neural Network (DNN) computations. The primary challenge lies in the inability of current operating systems to accurately detect an application’s memory access patterns. Additionally, most optimizations in existing DNN computation systems overlook the nuanced NUMA optimization challenges, such as those arising from inter-layer dependencies within DNNs and dependencies between memory blocks. These oversights result in less than optimal performance enhancements. To address these issues, this paper proposes a NUMA-aware DNN computing system. This system standardizes the memory access pattern across all DNN layers during the computation propagation process, thereby minimizing the inefficiencies associated with dynamic memory allocation through static NUMA optimization techniques. Furthermore, we propose a page-aligned memory allocation strategy designed to prevent non-local memory access, which often results from inter-block dependencies. Our findings demonstrate that, compared to current methodologies, the DNN computation efficiency in our system has achieved a maximum single-layer acceleration ratio of 1.63x and an overall acceleration ratio of 1.37x.
Loading