Abstract: This paper proposes echo-attention layers, an efficient method for improving the expressiveness of the self-attention layers without incurring significant parameter or training time costs. The key idea is to iteratively refine the attentional activations via stateful repeated computation, i.e., we compute the activations once and get $N$ refinements (echo-attentions) at a relatively cheap cost. To this end, we introduce an update and state transition function that operates over these attentional activations. Via a set of extensive experiments, we show that this the proposed Echoformer model demonstrates widespread benefits across 21 datasets including language modeling, machine translation, language understanding and question answering.
0 Replies
Loading