Semi-Automatic Labeling for Action Recognition by Diversity Preserving Sampling

Ryuhei Ando, Takashi Shibata, Toru Takahashi

Published: 2025, Last Modified: 23 Dec 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning for action recognition is an important technology for understanding videos. However, collecting video training dataset for deep learning model with low cost while maintaining enough diversity is challenging. In this paper, we propose a semi-automatic labeling framework for action recognition by diversity-preserving sampling. The proposed framework utilizes a pre-trained vision-language model (VLM) to search through video clips to filter data that matches the text that describes the appropriate context for the target action. Since this simple approach by VLM tends to lack diversity, our framework is also equipped with diversity-preserving sampling that consists of two sampling strategies. One is confidence-based weighted sampling, which is based on action class confidence obtained from VLM, and the other is isolate-constraint-based weighted sampling, which samples points that are far apart in the text-image feature space. We conduct experiments to demonstrate that the proposed approach efficiently collects data with variations that could train a better action recognition model than the baseline.

External IDs:dblp:conf/icassp/Ando0T25