Encoding scale into fisher vector for human action recognition

Bowen Zhang, Hanli Wang

Published: 2015, Last Modified: 13 Nov 2024VCIP 2015EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, a new kind of Fisher Vector (FV) model, named Scale FV (ScaleFV), is proposed to ameliorate visual feature encoding for human action recognition. Although several researches have been proposed for feature encoding, the temporal scale information is almost ignored. Similar to the spatial scale information which has shown to be important in extracting and encoding visual features, the temporal scale information also plays an important role in video content analysis based on our investigation. To demonstrate this, a definition of temporal scale in videos is given, and it is presented that both of the spatial and temporal scale information can be encoded into the FV model by slightly modifying the underlying Gaussian Mixture Models (GMM). Furthermore, an enhanced FV model termed as Combined FV (CombFV) is designed to capture both position and scale information for human action recognition. Comparative experiments are carried out to demonstrate the superior performance of the proposed methods.