SA-LfV: self-annotated labeling from videos for object detection

Arun Kumar Sivapuram, Komuravelli Prashanth, Rama Krishna Sai S. Gorthi

Published: 01 Jan 2025, Last Modified: 22 Jun 2025Mach. Learn. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the realm of object detection, the remarkable strides made by deep neural networks over the past decade have been hampered by challenges such as data labeling and the need to capture natural variations in training samples. Existing benchmark datasets are confined with limited set of classes, and natural variations. This paper presents "SA-LfV", a novel framework designed to streamline object detection from videos with minimal human input. By utilizing basic computer vision tasks, such as image classification and tracking single objects, our method generates pseudo-labels for object detection efficiently. To ensure a rich variety of training samples, we introduce two innovative sampling strategies. The first applies density-based clustering, choosing samples that represent a wide range of scenarios. The second analyzes object movements and their mutual information, capturing diverse behaviors and appearances. The proposed object detection data labeling procedure is demonstrated on object-tracking datasets and custom-downloaded videos. Through these methods, our framework has produced a dataset with 70,000 pseudo-labeled bounding boxes across 13 object classes, significantly diversifying the available data for object detection tasks. Our experiments show that the proposed framework can effectively adapt to unlabelled ImageNet classes, indicating its potential to broaden the capabilities of object detection models. Moreover, integrating our self-annotated dataset with standard benchmark datasets leads to a notable improvement in object detection performance. This new approach not only simplifies the traditionally labor-intensive process of manual labeling but also paves the way for expanding object detection to a wider range of classes and applications.