LSN: Long-Term Spatio-Temporal Network for Video Recognition

Published: 01 Jan 2022, Last Modified: 19 Oct 2024ICPCSEE (1) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the output together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-and-play manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.
Loading