Abstract: Video classification is an important and challenging task. Videos usually contain a series of key actions and motion patterns. Video classifiers need to learn and describe them with an embedding vector. Generally, these actions and patterns imply different semantic information. However, existing methods usually only consider single-level semantic features, such as the last pooling layer, to represent the entire video. Complex video content cannot be effectively represented, and classification accuracy is not good enough. To address the limitation, we propose a novel multi-semantic representation method for video classification. Our method consists of several transformer network blocks, semantic graph attention modules, and a feature fusion module. Each transformer block extracts the visual features of video frames and the features of the last block are transformed into an embedding vector. These blocks indicate different levels of visual features. The graph attention module uses these features to generate multi-semantic vectors of a video. Finally, these multi-semantic vectors and the previous embedding vector are fused by a feature fusion module. The fused vector is applied to classify video. Extensive experiments on a benchmark video classification dataset demonstrate that our method outperforms various state-of-the-art methods.
Loading