Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

Published: 01 Jan 2024, Last Modified: 16 May 2025SOCC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer-based neural networks (NNs) prevail in today’s artificial intelligence applications, including autonomous driving, natural language processing and generative modeling, showing superior accuracy and generalization over traditional deep-learning models. However, the quadratic scaling computation and complex dataflow in the self-attention set challenges to the efficient deployment of Transformer-based NNs on edge and edge-server devices, where the latency of single-batch inference is a critical concern. The lack of data parallelism necessitates exploring more dimensions in tensor parallelism, more specifically, sequence parallelism in transformer inference for strong scaling in domain-specific accelerator (DSA) design, which is non-trivial due to the temporal dependency of the max-finding in softmax operators. This work formulates these challenges into an on-chip buffering problem, and then puts forward a hardware-software co-design approach exploiting max-findingfree approximation for softmax operators, which removes the blocking of the inference pipeline and thus alleviates the onchip buffering pressure. An example architecture design shows up to $2.83 \times$ and $28.02 \times$ speedup, over the baseline DSA designs respectively, with negligible algorithmic performance loss.
Loading