Skip Transformers: Efficient Inference through Skip-Routing

Matthew Peroni; Dimitris Bertsimas

Skip Transformers: Efficient Inference through Skip-Routing

Matthew Peroni, Dimitris Bertsimas

Published: 10 Oct 2024, Last Modified: 02 Nov 2024FITML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: conditional computation, sparse activation, mixture of experts, transformers

TL;DR: Adding routers to transformer architecture that allow token embeddings to skip FFNs can improve inference efficiency and maintain model performance.

Abstract: As the scale of Transformer-based language models continues to increase, there is a growing need for methodological improvements in training and inference efficiency. Recent developments, such as IA3 and LoRA, have successfully addressed training efficiency for fine-tuning, but not inference efficiency. Inspired by recent work in Sparse Mixture of Experts and conditional computation in neural networks, we propose Skip Transformers, which modify the standard architecture by adding routers after each self-attention block in the Transformer architecture that decide whether to route each token embedding to the corresponding feed-forward neural network (FFN), or to skip the FFN and pass the existing embedding through to the next attention block. We refer to this process as skip-routing. Using a new set of penalty terms in the loss function and a specific router weight initialization scheme, we demonstrate empirically that adapting the Transformer architecture with skip-routing during fine-tuning can improve computational efficiency at inference while maintaining or improving performance on downstream tasks. These results, although preliminary, establish and motivate an exciting new direction for developing sparsely-activated Transformer models that improve model performance and inference efficiency.

Submission Number: 92

Loading