DTA: Deformable Temporal Attention for Video Recognition

Xiaohan Lei, Mengyu Yang, Gongli Xi, Yang Liu, Jiulin Li, Lanshan Zhang, Ye Tian

Published: 01 Jan 2024, Last Modified: 05 Nov 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, transformer models have demonstrated superior performance in video tasks. However, a prevalent limitation in most current video Transformers lies in their tendency to overlook inherent temporal regions of interest, such as motion trajectories, leading to susceptibility to redundant information during temporal modeling. Existing methods that pay attention to motion trajectories have high computational demands, lacking in lightweight efficiency. To strike a balance between effective modeling of temporal regions of interest and computational efficiency, we propose a video transformer backbone with deformable temporal attention (DTA). Inspired by the work on deformable receptive fields, DTA employs a lightweight decision network to enhance the flexibility of temporal attention. The decision network computes the offsets of tokens in the input feature map, enabling them to move to temporally relevant regions of interest and efficiently model temporal information. We conducted extensive experiments on three popular datasets and surpassed the baseline. Additionally, we performed ablation experiments specifically targeting the model structure and parameters. These results confirm the effectiveness of the proposed deformable temporal attention mechanism.