Skip Transformers: Efficient Inference through Skip-Routing

Published: 10 Oct 2024, Last Modified: 02 Nov 2024FITML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: conditional computation, sparse activation, mixture of experts, transformers
TL;DR: Adding routers to transformer architecture that allow token embeddings to skip FFNs can improve inference efficiency and maintain model performance.
Abstract: As the scale of Transformer-based language models continues to increase, there is a growing need for methodological improvements in training and inference efficiency. Recent developments, such as IA3 and LoRA, have successfully addressed training efficiency for fine-tuning, but not inference efficiency. Inspired by recent work in Sparse Mixture of Experts and conditional computation in neural networks, we propose Skip Transformers, which modify the standard architecture by adding routers after each self-attention block in the Transformer architecture that decide whether to route each token embedding to the corresponding feed-forward neural network (FFN), or to skip the FFN and pass the existing embedding through to the next attention block. We refer to this process as skip-routing. Using a new set of penalty terms in the loss function and a specific router weight initialization scheme, we demonstrate empirically that adapting the Transformer architecture with skip-routing during fine-tuning can improve computational efficiency at inference while maintaining or improving performance on downstream tasks. These results, although preliminary, establish and motivate an exciting new direction for developing sparsely-activated Transformer models that improve model performance and inference efficiency.
Submission Number: 92
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview