Cube-fx: Mapping Taylor Expansion Onto Matrix Multiplier-Accumulators of Huawei Ascend AI Processors

Yifeng Tang, Huaman Zhou, Zhuoran Ji, Cho-Li Wang

Published: 2025, Last Modified: 28 Jan 2026IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Taylor expansion, a mature method for function evaluations used in Artificial Intelligence (AI) applications, approximates functions with polynomials. In addition to the function evaluations, AI applications require massive matrix multiplications, inspiring manufacturers to propose AI processors with matrix multiplier-accumulators (MACs). However, compared with the powerful Matrix MACs, the vectorized units of the AI processors cannot efficiently carry the existing Taylor expansion implementation of Single Instruction Multiple Data (SIMD) parallelism. Leveraging the Matrix MACs for Taylor expansion becomes an ideal direction. In previous studies, migrating optimized algorithms to the Matrix MACs requires matrix generation during the runtime. The generation is expensive and even cancels the accelerations brought by the Matrix MACs on the AI processors, which Taylor expansion also suffers. This article presents Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Cube-fx expresses the building and computation in matrix multiplications without inefficient dynamic matrix generation. On Huawei Ascend processors, Cube-fx averagely achieves 1.64× speedups compared with vectorized Horner’s Method with 56.38$\%$ vectorized operations reduced.

External IDs:dblp:journals/tpds/TangZJW25