Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation. The actons are learned via a new max-margin multi-channel multiple instance learning framework. The learned actons (with no requirement for detailed manual annotations) thus observe a property of being compact, informative, discriminative, and easy to scale. This is different from the standard unsupervised (e.g. k-means) or supervised (e.g. random forests) coding strategies in action recognition. Applying the learned actons in our two-layer structure yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.
0 Replies
Loading