RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers
Abstract: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of **RePa**rameterizable **Vi**sion **T**ransformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy **66.8%** and **68.7%** speed-ups with **+1.7%** and **+1.1%** higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.
Lay Summary: Vision Transformers (ViTs) are powerful AI models used for image recognition and form the backbone of many computer vision tasks. However, as ViTs become larger to achieve higher accuracy, they also become slower and more computationally demanding. This makes them difficult to deploy on resource-limited devices like drones, mobile phones, or embedded systems. As a result, improving the efficiency of existing ViT models has become a key focus in AI research.
While many researchers have blamed the attention mechanism for causing most of the slowdown in ViTs, our study reveals that the feedforward network (FFN) layers are actually the main bottleneck, especially in large models. To address this, we propose a new method called RePaViT (short for ReParameterizable Vision Transformers), which lets some parts of the model skip nonlinear activation calculation in FFN layers. The linear pathway in our design facilitates structural reparameterization of FFN layers during inference, making ViT models smaller and significantly faster.
Experiments show that RePaViT can reduce processing time by nearly 70% for large models, while maintaining or even improving accuracy. This is the first time such a technique has been applied to the FFN layers in ViTs, and we believe it opens up exciting possibilities for building faster, smarter AI systems.
Link To Code: https://github.com/Ackesnal/RePaViT/
Primary Area: Deep Learning->Attention Mechanisms
Keywords: structural reparameterization, vision transformer, efficient
Submission Number: 8413
Loading