BAP: BRANCH-AWARE PARALLEL EXECUTION FOR FASTER DNN INFERENCE ON MOBILE CPUS

Chong Tang; Jagmohan Chauhan

BAP: BRANCH-AWARE PARALLEL EXECUTION FOR FASTER DNN INFERENCE ON MOBILE CPUS

Chong Tang, Jagmohan Chauhan

27 Sept 2024 (modified: 10 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Networks, Model Parallelism, Edge Devices, ASR Models, Transformers, Mobile CPUs

Abstract: The growing demand for real-time applications on edge devices underscores the need for faster inference of complex deep neural network (DNN) models. Although mobile devices increasingly incorporate specialized processors like GPUs and TPUs, modern DNN models such as Whisper and Vision Transformers often involve dynamic control flows and tensor operations that are incompatible and unsupported on current frameworks with these mobile accelerators. CPU presents the most viable option to improve inference latency on mobile devices due to their widespread availability, substantial memory caches, and ability to support all types of tensor operations. However, existing CPU optimization techniques focus on sequential execution, overlooking potential parallelization within Automatic Speech Recognition (ASR) and transformer-based models, leading to inefficiencies. This work introduces a novel runtime model analysis pipeline that extracts layer and branch structures from DNN model graphs to identify parallelizable branches. We propose BAP, a branch-aware memory allocation strategy that isolates memory arenas for parallel branches, reducing contention and optimizing memory reuse within each branch. Additionally, we leverage CPU multithreading to execute these branches concurrently, optimizing thread management and memory access to minimize overhead. Evaluated on ASR models and transformer-based models, our approach reduces inference latency by up to 38.5%, decreases memory allocation requirements by up to 15.6x and saves up to 20.2% energy cost compared to the TFLite naive memory allocation.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9257

Loading