Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision, excelling in tasks such as image classification, segmentation, and object detection. However, their quadratic complexity O(N2), where N is the token sequence length, poses challenges when deployed on resource-limited devices. To address this issue, dynamic token merging has emerged as an effective strategy, progressively reducing the token count during inference to achieve computational savings. Some strategies consider all tokens in the sequence as merging candidates, without focusing on spatially close tokens. Other strategies either limit token merging to a local window, or constrains it to pairs of adjacent tokens, thus not capturing more complex feature relationships. In this paper, we propose Dynamic Hierarchical Token Merging (DHTM), a novel token merging approach, where we advocate that spatially close tokens share more information than distant tokens and consider all pairs of spatially close cand
Loading