Approximation to Smooth Functions by Low-Rank Swish Networks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We offer a theoretical basis for low-rank compression from the perspective of universal approximation theory by proving any Hölder function can be approximated by a Swish network with low-rank weight matrices.
Abstract: While deep learning has witnessed remarkable achievements in a wide range of applications, its substantial computational cost imposes limitations on application scenarios of neural networks. To alleviate this problem, low-rank compression is proposed as a class of efficient and hardware-friendly network compression methods, which reduce computation by replacing large matrices in neural networks with products of two small ones. In this paper, we implement low-rank networks by inserting a sufficiently narrow linear layer without bias between each of two adjacent nonlinear layers. We prove that low-rank Swish networks with a fixed depth are capable of approximating any function from the Hölder ball $\mathcal{C}^{\beta, R}([0,1]^d)$ within an arbitrarily small error where $\beta$ is the smooth parameter and $R$ is the radius. Our proposed constructive approximation ensures that the width of linear hidden layers required for approximation is no more than one-third of the width of nonlinear layers, which implies that the computational cost can be decreased by at least one-third compared with a network with the same depth and width of nonlinear layers but without narrow linear hidden layers. Our theoretical finding can offer a theoretical basis for low-rank compression from the perspective of universal approximation theory.
Lay Summary: Deep learning drives breakthroughs in AI, but its massive computational demands hinder real-world use. Cutting network size via low-rank compression shows promise, yet lacks theoritical guarantees, limiting reliability. We offer a theoretical basis for low-rank compression from the perspective of universal approximation theory by proving any function from a board class can be approximated by a Swish network with low-rank weight matrices. Our findings partially guarantee that low-rank compression often serves as a viable approach for network compression, as it generally maintains performance while reducing model size.
Primary Area: Theory->Learning Theory
Keywords: Neural Network, Swish Activation Function, Universal Approximation Theory, Network Compression, Low-Rank Compression, Low-Rank Factorization
Submission Number: 627
Loading