Semi-Supervised Learning and Capsule Network for Video Action Detection

Van-Khoa Duong, Ngoc-Ha Pham-Thi, Ngoc-Thao Nguyen

Published: 2023, Last Modified: 14 May 2025RIVF 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The recent advancements in semi-supervised learning for video action detection have shown great potential. These approaches can effectively utilize the vast amount of unlabeled data available, while labeled data is usually limited and expensive. This paper introduces a novel architecture to approach the video action detection problem, which presents several modifications on a baseline end-to-end model. We pay more attention to the feature construction inside the architecture and enrich the data with a robust augmentation technique. Furthermore, instead of hiding the video action and localization labels of the unlabeled dataset, we leverage the classification labels to improve localization accuracy. Experiments on the benchmark UCF101-24 (24 classes), which includes only 20% of the training annotations, have shown the advantage of the proposed model in this video understanding task. The new approach outperformed the baseline model by 3.1% in f-mAP@0.5 and 4.4% in v-mAP@0.5, respectively, and reached competitive performance to those of supervised methods.