A Highly-Scalable Deep-Learning Accelerator With a Cost-Effective Chip-to-Chip Adapter and a C2C-Communication-Aware Scheduler

Jicheon Kim, Chunmyung Park, Eunjae Hyun, Xuan Truong Nguyen, Hyuk-Jae Lee

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE J. Emerg. Sel. Topics Circuits Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-chip-module (MCM) technology heralds a new era for scalable DNN inference systems, offering a cost-effective alternative to large-scale monolithic designs by lowering fabrication and design costs. Nevertheless, MCMs often incur resource and performance overheads due to inter-chip communication, which largely reduce a performance gain in a scaling-out system. To address these challenges, this paper introduces a highly-scalable DNN accelerator with a lightweight chip-to-chip adapter (C2CA) and a C2C-communication-aware scheduler. Our design employs a C2CA for inter-chip communication, which accurately illustrates an MCM system with a constrained C2C bandwidth, e.g., about 1/16, 1/8, or 1/4 of an on-chip bandwidth. We empirically reveal that the limited C2C bandwidth largely affects the overall performance gain of an MCM system. For example, compared with the one-core engine, a four-chip MCM system with a constrained C2C bandwidth only achieves $2.60\times $ , $3.27\times $ , $2.84\times $ , and $2.74\times $ performance gains on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS, respectively. Mitigating the problem, we propose a novel C2C-communication-aware scheduler with forward and backward inter-layer scheduling. Specifically, our scheduler effectively utilizes a C2C bandwidth while a core is performing its own computation. To demonstrate the effectiveness and practicality of our concept, we modeled our design with Verilog HDL and implemented it on an FPGA board, i.e., Xilinx ZCU104. The experimental results demonstrate that the system shows significant throughput improvements compared to a single-chip configuration, yielding average enhancements of $1.87\times $ and $3.43\times $ for two-chip and four-chip configurations, respectively, on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS.