Abstract: Online action detection aims to determine what is happening without waiting for the end of action. Most recent works have taken advantage of the excellent modeling capabilities of transformers to capture temporal contextual information. The basic transformer models the global relation of input tokens, and the computation and the number of parameters are huge. Though global relation is important for online action detection, the dominant factor should be the local dependencies among consecutive frames. In this paper, we improve the basic transformer for online action detection in three folds. Firstly, we replace the fully connected projection of an input token in standard self-attention block in transformers with a convolutional layer, which encodes local motion information among consecutive frames explicitly. As the modeling capacity is improved with both global and local correlation, less self-attention blocks are needed, which reduces computation and parameters. Secondly, we replace similarity computation between each pair of tokens with only computation similarities between the current frame and historical frames. Moreover, the distribution of such similarities is guided by a Gaussian function centering at current frame. A more convolutional layer with stride of two is added between two attention blocks to reduce the number of tokens in the next attention block to half. Therefore, the number of parameters and computation cost decrease. Thirdly, we design a decoder to retrieval the information of action categories by learning a query token for each action category. The final decision is made by combining the information from the encoder and decoder. We see that the proposed method achieves significant improvements on two benchmark datasets, THUMOS’14 and TVSeries datasets.
0 Replies
Loading