Abstract: Multi-chip-module (MCM) technology offers a promising solution for designing large-scale deep-learning inference systems while concurrently minimizing fabrication and design costs. Nevertheless, compared to monolithic dies, MCMs often incur resource and performance overhead due to interchip communication. Addressing these challenges, this paper introduces a scalable MCM-based DNN accelerator that incorporates a lightweight chip-to-chip adapter (C2CA) and an effective multi-chip dataflow. Inspired by the on-chip bus architecture, the C2CA efficiently shares data and address channels, achieving nearly optimal throughput with significantly reduced pin count and hardware costs. Additionally, the proposed design adopts a layer-wise dataflow within a ring-based C2C topology to fully utilize the constrained C2C bandwidth, mitigating both performance and communication overhead. When implemented on the Xilinx ZCU104 FPGA board, the system demonstrates significant throughput improvements compared to a single-chip configuration, yielding 1.92x and 3.57x enhancements for 2-chip and 4-chip configurations, respectively, on the YOLOv3-Tiny.
Loading