TL;DR: We analyze the loss landscape of neural networks with attention and explain why attention is helpful in training neural networks to achieve good performance.
Abstract: Attention mechanisms have advanced the state of the art in several machine learning tasks. Despite significant empirical gains, there is a lack of theoretical analyses on understanding their effectiveness. In this paper, we address this problem by studying the landscape of population and empirical loss functions of attention-based neural networks. Our results show that, under mild assumptions, every local minimum of a two-layer global attention model has low prediction error, and attention models require lower sample complexity than models not employing attention. We then extend our analyses to the popular self-attention model, proving that they deliver consistent predictions with a more expressive class of functions. Additionally, our theoretical results provide several guidelines for designing attention mechanisms. Our findings are validated with satisfactory experimental results on MNIST and IMDB reviews dataset.
Keywords: Attention, deep learning, sample complexity, self-attention
Original Pdf: pdf
9 Replies
Loading