Neural network approximation: Three hidden layers are enough

Zuowei Shen, Haizhao Yang, Shijun Zhang

Published: 2021, Last Modified: 28 Sept 2024Neural Networks 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: A three-hidden-layer neural network with super approximation power is introduced. This network is built with the floor function (⌊x⌋<math><mrow is="true"><mo is="true">⌊</mo><mi is="true">x</mi><mo is="true">⌋</mo></mrow></math>), the exponential function (2x<math><msup is="true"><mrow is="true"><mn is="true">2</mn></mrow><mrow is="true"><mi is="true">x</mi></mrow></msup></math>), the step function (1x≥0<math><msub is="true"><mrow is="true"><mi mathvariant="double-struck" class="mathds" is="true">1</mi></mrow><mrow is="true"><mi is="true">x</mi><mo is="true">≥</mo><mn is="true">0</mn></mrow></msub></math>), or their compositions as the activation function in each neuron and hence we call such networks as Floor-Exponential-Step (FLES) networks. For any width hyper-parameter N∈N+<math><mrow is="true"><mi is="true">N</mi><mo linebreak="goodbreak" linebreakstyle="after" is="true">∈</mo><msup is="true"><mrow is="true"><mi mathvariant="double-struck" is="true">N</mi></mrow><mrow is="true"><mo is="true">+</mo></mrow></msup></mrow></math>, it is shown that FLES networks with width max{d,N}<math><mrow is="true"><mo class="qopname" is="true">max</mo><mrow is="true"><mo is="true">{</mo><mi is="true">d</mi><mo is="true">,</mo><mi is="true">N</mi><mo is="true">}</mo></mrow></mrow></math> and three hidden layers can uniformly approximate a Hölder continuous function f<math><mi is="true">f</mi></math> on [0,1]d<math><msup is="true"><mrow is="true"><mrow is="true"><mo is="true">[</mo><mn is="true">0</mn><mo is="true">,</mo><mn is="true">1</mn><mo is="true">]</mo></mrow></mrow><mrow is="true"><mi is="true">d</mi></mrow></msup></math> with an exponential approximation rate 3λ(2d)α2−αN<math><mrow is="true"><mn is="true">3</mn><mi is="true">λ</mi><msup is="true"><mrow is="true"><mrow is="true"><mo is="true">(</mo><mn is="true">2</mn><msqrt is="true"><mrow is="true"><mi is="true">d</mi></mrow></msqrt><mo is="true">)</mo></mrow></mrow><mrow is="true"><mi is="true">α</mi></mrow></msup><msup is="true"><mrow is="true"><mn is="true">2</mn></mrow><mrow is="true"><mo is="true">−</mo><mi is="true">α</mi><mi is="true">N</mi></mrow></msup></mrow></math>, where α∈(0,1]<math><mrow is="true"><mi is="true">α</mi><mo linebreak="goodbreak" linebreakstyle="after" is="true">∈</mo><mrow is="true"><mo is="true">(</mo><mn is="true">0</mn><mo is="true">,</mo><mn is="true">1</mn><mo is="true">]</mo></mrow></mrow></math> and λ>0<math><mrow is="true"><mi is="true">λ</mi><mo linebreak="goodbreak" linebreakstyle="after" is="true">></mo><mn is="true">0</mn></mrow></math> are the Hölder order and constant, respectively. More generally for an arbitrary continuous function f<math><mi is="true">f</mi></math> on [0,1]d<math><msup is="true"><mrow is="true"><mrow is="true"><mo is="true">[</mo><mn is="true">0</mn><mo is="true">,</mo><mn is="true">1</mn><mo is="true">]</mo></mrow></mrow><mrow is="true"><mi is="true">d</mi></mrow></msup></math> with a modulus of continuity ωf(⋅)<math><mrow is="true"><msub is="true"><mrow is="true"><mi is="true">ω</mi></mrow><mrow is="true"><mi is="true">f</mi></mrow></msub><mrow is="true"><mo is="true">(</mo><mi is="true">⋅</mi><mo is="true">)</mo></mrow></mrow></math>, the constructive approximation rate is 2ωf(2d)2−N+ωf(2d2−N)<math><mrow is="true"><mn is="true">2</mn><msub is="true"><mrow is="true"><mi is="true">ω</mi></mrow><mrow is="true"><mi is="true">f</mi></mrow></msub><mrow is="true"><mo is="true">(</mo><mn is="true">2</mn><msqrt is="true"><mrow is="true"><mi is="true">d</mi></mrow></msqrt><mo is="true">)</mo></mrow><msup is="true"><mrow is="true"><mn is="true">2</mn></mrow><mrow is="true"><mo is="true">−</mo><mi is="true">N</mi></mrow></msup><mo linebreak="goodbreak" linebreakstyle="after" is="true">+</mo><msub is="true"><mrow is="true"><mi is="true">ω</mi></mrow><mrow is="true"><mi is="true">f</mi></mrow></msub><mrow is="true"><mo is="true">(</mo><mn is="true">2</mn><msqrt is="true"><mrow is="true"><mi is="true">d</mi></mrow></msqrt><mspace width="0.16667em" is="true"></mspace><msup is="true"><mrow is="true"><mn is="true">2</mn></mrow><mrow is="true"><mo is="true">−</mo><mi is="true">N</mi></mrow></msup><mo is="true">)</mo></mrow></mrow></math>. Moreover, we extend such a result to general bounded continuous functions on a bounded set E⊆Rd<math><mrow is="true"><mi is="true">E</mi><mo linebreak="goodbreak" linebreakstyle="after" is="true">⊆</mo><msup is="true"><mrow is="true"><mi mathvariant="double-struck" is="true">R</mi></mrow><mrow is="true"><mi is="true">d</mi></mrow></msup></mrow></math>. As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of ωf(r)<math><mrow is="true"><msub is="true"><mrow is="true"><mi is="true">ω</mi></mrow><mrow is="true"><mi is="true">f</mi></mrow></msub><mrow is="true"><mo is="true">(</mo><mi is="true">r</mi><mo is="true">)</mo></mrow></mrow></math> as r→0<math><mrow is="true"><mi is="true">r</mi><mo is="true">→</mo><mn is="true">0</mn></mrow></math> is moderate (e.g., ωf(r)≲rα<math><mrow is="true"><msub is="true"><mrow is="true"><mi is="true">ω</mi></mrow><mrow is="true"><mi is="true">f</mi></mrow></msub><mrow is="true"><mo is="true">(</mo><mi is="true">r</mi><mo is="true">)</mo></mrow><mo linebreak="goodbreak" linebreakstyle="after" is="true">≲</mo><msup is="true"><mrow is="true"><mi is="true">r</mi></mrow><mrow is="true"><mi is="true">α</mi></mrow></msup></mrow></math> for Hölder continuous functions), since the major term to be concerned in our approximation rate is essentially d<math><msqrt is="true"><mrow is="true"><mi is="true">d</mi></mrow></msqrt></math> times a function of N<math><mi is="true">N</mi></math> independent of d<math><mi is="true">d</mi></math> within the modulus of continuity. Finally, we extend our analysis to derive similar approximation results in the Lp<math><msup is="true"><mrow is="true"><mi is="true">L</mi></mrow><mrow is="true"><mi is="true">p</mi></mrow></msup></math>-norm for p∈[1,∞)<math><mrow is="true"><mi is="true">p</mi><mo linebreak="goodbreak" linebreakstyle="after" is="true">∈</mo><mrow is="true"><mo is="true">[</mo><mn is="true">1</mn><mo is="true">,</mo><mi is="true">∞</mi><mo is="true">)</mo></mrow></mrow></math> via replacing Floor-Exponential-Step activation functions by continuous activation functions.