Attention-Only Transformers and Implementing MLPs with Attention Heads

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX
Keywords: transformer, neural network, architecture, attention
TL;DR: We show that MLP neurons can be implemented by masked, rank-1 attention heads, allowing one to convert an MLP-and-attention transformer into an attention-only transformer.
Abstract: The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads.
Submission Number: 62
Loading