Abstract: Vision Transformers (ViTs) achieve state-of-the-art accuracy in numerous vision tasks, but their heavy computational and memory requirements pose significant challenges. Minimising token-related computations is critical to alleviating this computational burden. This paper introduces a novel SuperToken and Early-Pruning (STEP) approach that combines patch merging along with an early-pruning mechanism to optimize token handling in ViTs for semantic segmentation. The improved patch merging method is developed to effectively address the diverse complexities of images. It features a dynamic and adaptive system, dCTS, which employs a CNN-based policy network to determine the quantity and size of patch groups that share the same supertoken during inference. With a flexible merging strategy, it handles superpatches of varying sizes: 2×2, 4×4, 8×8, and 16×16. Early in the network, high-confidence tokens are discarded and preserved from subsequent processing stages. This hybrid approach reduce
Loading