Cluster-guided temporal modeling for action recognition

Jeong-Hun Kim, Fei Hao, Carson Kai-Sang Leung, Aziz Nasridinov

Published: 01 Jan 2023, Last Modified: 22 Jun 2025Int. J. Multim. Inf. Retr. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Action recognition is a video understanding task that is carried out to recognize an action of an object in a video. In order to recognize the action, it is necessary to extract motion information through temporal modeling. However, videos typically contain high temporal redundancy, such as iterative events and adjacent frames. This high temporal redundancy weakens information related to actual action, making it difficult for the final classifier to recognize the action. In this article, we focus on preserving helpful information for action recognition by reducing the high temporal redundancy in videos. To achieve this goal, we propose a novel frame selection method called cluster-guided frame selection (CluFrame). Specifically, CluFrame compresses an input video into keyframes of clusters discovered by applying \(k\)-means clustering to frame-wise features extracted from pre-trained 2D-CNNs in the temporal compression (TC) module. In addition, CluFrame selects keyframes related to the action of the input video by optimizing the TC module based on the action recognition results. Experimental results on five benchmark datasets demonstrate that CluFrame addresses the high temporal redundancy in the video and achieves action recognition accuracy improvement over existing action recognition methods by up to 6.6% and by about 0.7% compared to state-of-the-art frame selection methods.