Abstract: FPGAs are gathering traction as a platform for the acceleration of applications requiring both high performance and specialization. However, exploiting the maximum compute potential of FPGAs remains a critical and time-consuming task, usually requiring expert knowledge. Typically, designers seek to maximize the usage of hardened arithmetic blocks (DSP, such as DSP48 in Xilinx devices), but as their number is limited, the critical path quickly increases when portions are mapped to lookup tables (LUT). To mitigate the DSP limitation and to maximize FPGA utilization, we propose combining FPGA overlay accelerators and a mapping method that efficiently exploits the FPGA's layout information and its resources. This mapping method relies on a two-step process: 1. extraction of architectural and layout information of the FPGA, 2. optimized placement of the processing elements (PEs) of the accelerator onto the FPGA resources. The placement step maps the PEs to DSPs and LUTs to reduce the critical path among PEs. We applied our method to implement a systolic array, a multiplier array, and a coarse-grained reconfigurable architecture (CGRA) on a Xilinx FPGA. The proposed method achieves more than 14 x performance and energy efficiency increase over the vendor tool mapping while equally maximizing FPGA utilization by more than 1.5 x compared to DSP limited mappings.
Loading