TL;DR: We study the problem of representing phrases as atoms in attention.
Abstract: How to represent the sentence ``That's the last straw for her''? The answer of the self-attention is a weighted sum of each individual words, i.e. $$semantics=\alpha_1Emb(\text{That})+\alpha_2Emb(\text{'s})+\cdots+\alpha_nEmb(\text{her})$$. But the weighted sum of ``That's'', ``the'', ``last'', ``straw'' can hardly represent the semantics of the phrase. We argue that the phrases play an important role in attention.
If we combine some words into phrases, a more reasonable representation with compositions is
$$semantics=\alpha_1Emb(\text{That's})+Emb_2(\text{the last straw})+\alpha_3Emb(\text{for})+\alpha_4Emb(\text{her})$$.
While recent studies prefer to use the attention mechanism to represent the natural language, few noticed the word compositions. In this paper, we study the problem of representing such compositional attentions in phrases. In this paper, we proposed a new attention architecture called HyperTransformer. Besides representing the words of the sentence, we introduce hypernodes to represent the candidate phrases in attention.
HyperTransformer has two phases. The first phase is used to attend over all word/phrase pairs, which is similar to the standard Transformer. The second phase is used to represent the inductive bias within each phrase. Specially, we incorporate the non-linear attention in the second phase. The non-linearity represents the the semantic mutations in phrases. The experimental performance has been greatly improved. In WMT16 English-German translation task, the BLEU increases from 20.90 (by Transformer) to 34.61 (by HyperTransformer).
Keywords: representation learning, natural language processing, attention
Original Pdf: pdf
12 Replies
Loading