Are all Frames Equal? Active Sparse Labeling for Video Action DetectionDownload PDF

Published: 31 Oct 2022, 18:00, Last Modified: 11 Oct 2022, 23:42NeurIPS 2022 AcceptReaders: Everyone
Keywords: active sparse labeling, video action detection
TL;DR: Efficient labeling of videos using active learning to minimize labeling cost for action detection and other video related task.
Abstract: Video action detection requires annotations at every frame, which drastically increases the labeling cost. In this work, we focus on efficient labeling of videos for action detection to minimize this cost. We propose active sparse labeling (ASL), a novel active learning strategy for video action detection. Sparse labeling will reduce the annotation cost but poses two main challenges; 1) how to estimate the utility of annotating a single frame for action detection as detection is performed at video level?, and 2) how these sparse labels can be used for action detection which require annotations on all the frames? This work attempts to address these challenges within a simple active learning framework. For the first challenge, we propose a novel frame-level scoring mechanism aimed at selecting most informative frames in a video. Next, we introduce a novel loss formulation which enables training of action detection model with these sparsely selected frames. We evaluate the proposed approach on two different action detection benchmark datasets, UCF-101-24 and J-HMDB-21, and observed that active sparse labeling can be very effective in saving annotation costs. We demonstrate that the proposed approach performs better than random selection, outperforming all other baselines, with performance comparable to supervised approach using merely 10% annotations.
Supplementary Material: pdf
19 Replies