Clustering-based multi-featured self-supervised learning for human activities and video retrieval

Published: 2024, Last Modified: 21 Jan 2026Appl. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human-centric content-based video retrieval has emerged as a prominent area of research due to its diverse applications. However, this task presents several inherent challenges, including end-to-end image classification and data sampling. Despite the significant progress made by self-supervised learning methods in addressing these challenges, there are still some issues that need to be addressed. Among those, one major concern is the generation of randomly sampled inverse-complementary pairs. The process of generating such pairs requires careful handling to avoid false positives. Moreover, a common assumption that the similarity between video clips is solely temporal neglects the role of other factors, such as motion. To address this issue, a clustering-based multi-featured self-supervised learning model called CMS2L is proposed in this paper. Our model introduces a fundamental improvement by fixing intra-class positive sampling to avoid false labeling during stage training due to looping clusters. Additionally, it employs a second stream with an expanded range of features to achieve a more comprehensive representation of actions. Experimental results on benchmark datasets demonstrate the superiority of our proposed model.
Loading