Bulk Bitwise Accumulation in Commercial DRAM

Tatsuya Kubo; Masayuki Usui; Tomoya Nagatani; Daichi Tokuda; Lei Qu; Ting Cao; Shinya Takamaeda-Yamazaki

Bulk Bitwise Accumulation in Commercial DRAM

Tatsuya Kubo, Masayuki Usui, Tomoya Nagatani, Daichi Tokuda, Lei Qu, Ting Cao, Shinya Takamaeda-Yamazaki

Published: 17 Oct 2024, Last Modified: 08 Dec 2024MLNCP OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bulk Bitwise Accumulation, Processing-in-memory, Commercial DRAM

TL;DR: This paper presents a novel method for efficient bulk bitwise accumulation using commercial DRAM, demonstrating significant performance improvements over GPU while maintaining high accuracy, potentially accelerating machine learning tasks.

Abstract: Processing-in-memory (PIM) is a promising paradigm for addressing data transfer bottlenecks in data-intensive workloads, particularly in machine learning. Among PIM techniques, Processing-using-Commercial-DRAM (PuCD) offers a practical approach for enabling in-memory computing by employing widely available DRAM modules without hardware modifications. With its massive bit-level parallelisms, PuCD has a high-performance capability of bulk bit logic operation. However, implementing $\textit{accumulation}$ operations, crucial for machine learning tasks, remains challenging in PuCD. The need for multiple consecutive operations in accumulation leads to increased latency and error propagation. To address these challenges, we propose a novel method for bulk bitwise accumulation using PuCD. As a fundamental building block for our accumulation method, we introduce a novel implementation of the $\textit{population-count-of-3}$ ($\texttt{POPCNT3}$) operation tailored for commercial DRAM. On top of this, we present a $\texttt{POPCNT3}$-based bitwise accumulation method that efficiently handles large input sizes, enabling scalable bitwise accumulation for various input sizes. We evaluate the throughput and errors of our approach using commercial DDR4 DRAM modules with an FPGA. The experiments indicate that the throughput improvement is up to 348 times over A100 GPU across various input sizes with negligible errors to maintain the accuracy of machine learning applications. These results demonstrate that PuCD can provide a practical pathway for accelerating machine learning tasks without requiring specialized memory chips.

Submission Number: 42

Loading