High Performance Hierarchical Tucker Tensor Learning Using GPU Tensor Cores

Published: 01 Jan 2023, Last Modified: 28 Jan 2025IEEE Trans. Computers 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Extracting information from large-scale high-dimensional data is a fundamentally important task in high performance computing, where the hierarchical Tucker (HT) tensor learning approach (learning a tensor-tree structure) has been widely used in many applications. However, HT tensor learning algorithms are compute-intensive due to the “curse of dimensionality,” i.e., the time complexity grows exponentially with the order of the data tensor. The computation of HT tensor learning algorithms boils down to tensor primitives, which are amenable to computing on GPU tensor cores. Existing work does not support HT tensor learning using GPU tensor cores. There are three main challenges to address: 1) to accelerate tensor learning primitives using GPU tensor cores; 2) to implement the tensor learning algorithms using GPU tensor cores and multiple GPUs; 3) to support large-scale data tensors exceeding the GPU memory capacity. In this paper, we present efficient HT tensor learning primitives using GPU tensor cores and demonstrate three applications. First, we utilize GPU tensor cores to optimize HT tensor learning primitives, including tensor contractions, tensor matricizations and tensor singular value decomposition (SVD). We employ the optimized primitives to optimize HT tensor decomposition algorithms for Big Data analysis. Second, we propose a novel HT tensor layer for deep neural networks, whose training process only involves a forward pass without back propagation. The forward pass consists of tensor operations, thus further exploiting the computing power of GPU tensor cores. Third, we apply the optimized primitives to develop a tensor-tree structured quantum machine learning algorithm tree-tensor network (TTN). Compared with TensorLy and TensorNetwork on NVIDIA A100 GPUs, our third-order HT tensor decomposition algorithm achieves up to $8.92 \times$ and $6.42 \times$ speedups, respectively, and our high-order case achieves up to $32.67 \times$ and $23.97 \times$ speedups, respectively. Our HT tensor layer for a fully connected neural network achieves $49.2 \times$ compression at the cost of 0.5% drops in accuracy and $1.42 \times$ speedup compared with the implementation on CUDA cores; for the AlexNet, our HT tensor layer achieves $9.45 \times$ compression at the cost of 0.8% drops in accuracy and $1.87 \times$ speedup compared with the implementation on CUDA cores. Our TTN algorithm achieves up to $11.17\times$ speedup compared with TensorNetwork, indicating the potential of optimized tensor learning primitives for the classical simulation of quantum machine learning algorithms.
Loading