Deep learning inference optimisation for IoT: Conv2D-ReLU-BN layer fusion and quantisation

José I. Mestre, Sergio Barrachina, Darwin Quezada, Manuel F. Dolz

Published: 01 Jan 2025, Last Modified: 15 May 2025J. Supercomput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The deployment of deep learning models on resource-constrained devices requires the development of new optimisation techniques to effectively exploit the computational and storage capacities of these devices. Thus, the primary objective of this research is to introduce an innovative and efficient approach for fusing convolution (or fully connected), ReLU, and batch normalisation neural network layers into a unified, single-layer structure, alongside a quantisation method for this new fused layer. This approach has been evaluated using the Arduino BLE Sense ARM Cortex-M4 and the Arduino Portenta H7 Lite ARM Cortex-M4 and M7 processors, known for their widespread adoption in various Internet of Things devices. Depending on the microcontroller unit and compilation flag used, the fused layers can reduce the overall execution time by up to 1.53\(\times\), and on individual layers it can reach a speedup of 2.95\(\times\).