Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of ViewDownload PDF

25 Sept 2019 (modified: 03 Apr 2024)ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: Transformer, Ordinary Differential Equation, Multi-Particle Dynamic System, Natural Language Processing
Abstract: The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Given this ODE's perspective, the rich literature of numerical analysis can be brought to guide us in designing effective structures beyond the Transformer. As an example, we propose to replace the Lie-Trotter splitting scheme by the Strang-Marchuk splitting scheme, a scheme that is more commonly used and with much lower local truncation errors. The Strang-Marchuk splitting scheme suggests that the self-attention and position-wise feed-forward network (FFN) sub-layers should not be treated equally. Instead, in each layer, two position-wise FFN sub-layers should be used, and the self-attention sub-layer is placed in between. This leads to a brand new architecture. Such an FFN-attention-FFN layer is "Macaron-like", and thus we call the network with this new architecture the Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks. The reproducible code can be found on http://anonymized
Code: [![github](/images/github_icon.svg) zhuohan123/macaron-net](https://github.com/zhuohan123/macaron-net) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=SJl1o2NFwS)
Data: [CoLA](https://paperswithcode.com/dataset/cola), [GLUE](https://paperswithcode.com/dataset/glue), [MRPC](https://paperswithcode.com/dataset/mrpc), [QNLI](https://paperswithcode.com/dataset/qnli), [SST](https://paperswithcode.com/dataset/sst), [SST-2](https://paperswithcode.com/dataset/sst-2)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:1906.02762/code)
Original Pdf: pdf
5 Replies

Loading