Keywords: inference speedup, bilinear operators, non-convex optimization
TL;DR: New GPU-native bilinear operator to replace linear layers in NNs, which is both cheaper and more expressive.
Abstract: Modern AI relies on huge matrix multiplications (MatMuls), whose computational cost poses a scalability problem for inference and training. We propose an alternative, *GPU native* bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially *fewer* FLOPs to evaluate ($\ll n^3$), yet *increases* the parameter count compared to MatMul ($\gg n^2$). We call this operator *Strassen-Tile* (STL).
The key idea behind STL is a local **learnable** change-of-basis, applied on weight and activation *tiles*, followed by an *element-wise* product between the tiles, implemented simultaneously via MatMul. The key question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations of STL (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization, which is explained by the increased parameters of STL. This phenomenon motivates further algorithmic study of STL optimization in DNNs.
Our experiments demonstrate that STL can approximate 4x4 MatMul while reducing FLOPs by a factor of $2.66$, and can **improve** Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized `PyTorch` code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 16933
Loading