Optimizing Cache Accesses with Tensor Memory Format Search for Transformers in TVM

Xianghuan He, Xitong Gao, Juanjuan Zhao, Chengxi Gao, Kejiang Ye

Published: 2022, Last Modified: 27 Nov 2025CLOUD 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer-based models have achieved great success in natural language processing and computer vision applications. These models, however, often comprise a large number of parameters. Furthermore, tend to be computationally intensive. This presents a challenge in deploying them on resource-constrained devices. Using deep learning compilers, e.g. TVM, to compile these models can reap the performance benefit gained by tailoring CUDA kernels specifically for the target GPU devices. In this paper, we focus on complementing existing compiler optimization passes in TVM by further exploring the impact of tensor memory formats used by intermediate activations on cache accesses and its performance implications. First, building on top of the graph-based abstraction, we express each layer node, e.g. multi-layer perceptron (MLP) and self-attention layers, using Einstein summation or Einsum-based notations. Edges formed by intermediate tensors thus connecting layer nodes as their inputs and outputs. As intermediate tensors are typically stored in memory contiguously, their memory formats in terms of the ordering of its dimensions, may exhibit the notable effect on cache access behavior, and strided memory accesses are typically slower than contiguous ones. Yet existing compiler frameworks focus on layer-wise optimizations, and often neglected the impact of tensor memory formats of the layer’s inputs and outputs on the performance of the resulting kernels. To this end, this paper proposes to optimize the performance of compiled models by searching for optimal memory formats for all intermediate tensors. We then use the MLP-Mixer model architecture as a case study of the optimization process and deploy the resulting optimized models with TVM on target GPUs. As exhaustive searching requires a substantial computational cost, we thus propose algorithms to efficiently navigate the search space of memory formats of intermediate tensors. Applying the algorithm on an MLP-Mixer model with 42 mixer-layers, we can achieve 23.7% inference performance enhancement.

External IDs:dblp:conf/cloud2/HeGZGY22