Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben; Valerie Morris

Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben, Valerie Morris

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX

Keywords: transformer, neural network, architecture, attention

TL;DR: We show that MLP neurons can be implemented by masked, rank-1 attention heads, allowing one to convert an MLP-and-attention transformer into an attention-only transformer.

Abstract: The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads.

Submission Number: 62

Loading