Auto-tuning Matrix Multiplication and Convolution for Deep Learning on CPUsDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: deep learning compiler, ideal cache model, loop tiling, auto-tuning, matrix multiplication, convolution
Abstract: Deep learning (DL) compilers have emerged aiming to reduce the gap between abundant, fast-growing DL models and the lag of high performance implementations of these models on diverse hardware devices. In this work, we introduce several optimization strategies, combining analytic ideal cache models with machine learning models trained with real hardware measures, and integrate them into a unified auto-tuning framework, called AutoMCL, to improve the performance of DL compilers on both the operation level and the end-to-end model inference. We evaluate AutoMCL and compare it with state-of-the-art on multiple CPUs. End-to-end evaluations show that AutoMCL outperforms TensforFlow on fully connected and convolutional neural networks with respectively a geometric mean of $9.29\times$ and $1.54\times$ speedup. Over the baseline AutoTVM, on average, AutoMCL achieves respectively $1.37\times$ and $2.16\times $ speedup in inference and optimization time for fully connected neural networks and gains $2.55\%$ performance improvement in inference for convolutional neural networks with $1.91\%$ more optimization cost.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Supplementary Material: pdf
5 Replies

Loading