Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers

Mark Deutel; Frank Hannig; Christopher Mutschler; Jürgen Teich

Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers

Mark Deutel, Frank Hannig, Christopher Mutschler, Jürgen Teich

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Edge AI, Microcontrollers, CNNs

Abstract: Convolutional Neural Networks (CNNs) have been established as the dominant approach to computer vision tasks. As a result, efficient inference of CNNs has become a major concern to enable the processing of image data close to where it is generated by camera sensors, most commonly microcontroller units (MCUs). However, major obstacles to deploying CNNs on MCUs are the strict memory and bandwidth constraints that make processing high-resolution images on many MCUs infeasible. In this work, we propose a method to fuse convolutional layers in quantized CNNs, which can serve as an additional dimension for optimizing the memory requirements of CNNs during inference. By fusing memory-intensive convolutions in the early inverted residual blocks of MobileNetv2-like CNNs, we show that memory requirements during inference can be reduced by up to 54\% at the cost of only about a 14\% increase in latency and no change in accuracy. As an example, we show that this reduction enables the deployment of image processing pipelines on a Cortex-M7 MCU that supports image resolutions up to $320\times320$ pixels compared to the $128\times128$ pixels resolution commonly used in related work.

Submission Number: 43

Loading