Efficient Deployment of Transformer Models on Edge TPU Accelerators: A Real System Evaluation

Brendan C Reidy; Mohammadreza Mohammadi; Mohammed E Elbtity; Ramtin Zand

Efficient Deployment of Transformer Models on Edge TPU Accelerators: A Real System Evaluation

Brendan C Reidy, Mohammadreza Mohammadi, Mohammed E Elbtity, Ramtin Zand

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone

Keywords: Transformer Model, Edge AI Accelerators, Tensor Processing Unit (TPU), BERT

TL;DR: The paper proposes a methodology to deploy large transformer models on edge AI accelerators via modifying their computational graphs and refactoring the operations that are not supported by edge devices.

Abstract: Transformer models have become a dominant architecture in the world of machine learning. From natural language processing to more recent computer vision applications, Transformers have shown remarkable results and established a new state-of-the-art in many domains. However, this increase in performance has come at the cost of ever-increasing model sizes requiring more resources to deploy. Machine learning (ML) models are used in many real-world systems, such as robotics, mobile devices, and Internet of Things (IoT) devices, that require fast inference with low energy consumption. For battery-powered devices, lower energy consumption directly translates into longer battery life. To address these issues, several edge AI accelerators have been developed. Among these, the Coral Edge TPU has shown promising results for image classification while maintaining very low energy consumption. Many of these devices, including the Coral TPU, were originally designed to accelerate convolutional neural networks, making deployment of Transformers challenging. Here, we propose a methodology to deploy Transformers on Edge TPU. We provide extensive latency, power, and energy comparisons among the leading-edge devices and show that our methodology allows for real-time inference of large Transformers while maintaining the lowest power and energy consumption of the leading-edge devices on the market.

Workshop Track: ASSYST

Presentation: In-Person

Presenter Full Name: Mohammadreza Mohammadi

Presenter Email: mohammm@email.sc.edu

Presenter Bio: Mohammadreza Mohammadi is a second-year Ph. D. Student at the Intelligent Circuits, Architectures, and Systems (iCAS) Lab at the University of South Carolina. His research interests include Edge Computing, Neuromorphic Computing, and AI Hardware Acceleration.

3 Replies

Loading