Abstract: Most deep neural networks use non-periodic and monotonic—or at least
quasiconvex— activation functions. While sinusoidal activation functions have
been successfully used for specific applications, they remain largely ignored and
regarded as difficult to train. In this paper we formally characterize why these
networks can indeed often be difficult to train even in very simple scenarios, and
describe how the presence of infinitely many and shallow local minima emerges
from the architecture. We also provide an explanation to the good performance
achieved on a typical classification task, by showing that for several network architectures
the presence of the periodic cycles is largely ignored when the learning
is successful. Finally, we show that there are non-trivial tasks—such as learning
algorithms—where networks using sinusoidal activations can learn faster than
more established monotonic functions.
TL;DR: Why nets with sine as activation function are difficult to train in theory. Also, they often don't use the periodic part if not needed, but when it's beneficial they might learn faster
Conflicts: tut.fi
Keywords: Theory, Deep learning, Optimization, Supervised Learning
11 Replies
Loading