Minimum Width for Universal Approximation using Squashable Activation Functions

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Abstract: The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$, the minimum width is $\max\\{d_x,d_y,2\\}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.
Lay Summary: Neural networks are the engines behind many AI tools we use today, like voice assistants and image recognition. In this paper, we investigate how small a neural network can be while performing a target task. We show that even very narrow networks (just a few computational units wide) can still handle any task, as long as they are deep enough. Specifically, the network only needs to be as wide as the number of inputs or outputs, whichever is bigger. In the simplest case, just two units can be enough. In summary, our results show that AI systems do not always need to be wide to be powerful.
Primary Area: Deep Learning->Theory
Keywords: universal approximation, minimum width
Submission Number: 3462
Loading