Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective

Abstract: Deep Neural Network (DNN) INFerence-as-a-Service (INFaaS) is the dominating workload in current data centers, for which FPGAs become promising hardware platforms because of their high flexibility and energy efficiency. The dynamic and multi-tenancy nature of INFaaS requires careful design in three aspects: multi-tenant architecture, multi-DNN scheduling, and multi-core mapping. These three factors are critical to the system latency and energy efficiency but are also challenging to optimize since they are tightly coupled and correlated. This paper proposes <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">H3M</b> , an automatic Design Space Exploration (DSE) framework to jointly optimize the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">architecture</i> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scheduling</i> , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mapping</i> for serving INFaaS on cloud FPGAs. H3M explores: (1) the architecture design space with <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>H</b>eterogeneous</i> spatial <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>M</b>ulti-tenant</i> sub-accelerators, (2) layer-wise scheduling for <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>H</b>eterogeneous</i> <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>M</b>ulti-DNN</i> workloads, and (3) single-layer mapping to the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>H</b>omogeneous</i> <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><b>M</b>ulti-core</i> architecture. H3M beats state-of-the-art multi-tenant DNN accelerators, Planaria and Herald, by up to 7.5× and 3.6× in Energy-Delay-Product (EDP) reduction on the ASIC platform. On the Xilinx U200 and U280 FPGA platforms, H3M offers 2.1-5.7× and 1.8-9.0× EDP reduction over Herald.
0 Replies
Loading