Reframing Dense Action Detection (RefDense): A New Perspective on Problem Solving and a Novel Optimization Strategy
Keywords: Dense Action Detection, Video Understanding
Abstract: In dense action detection, we aim to detect multiple co-occurring actions. However, action classes are often ambiguous, as they share overlapping sub-components. We argue that the dual challenges of temporal and class overlaps are too complex to be effectively addressed as a single problem by a unified network. To overcome this, we propose decomposing the task into detecting temporally dense but unambiguous sub-components underlying the action classes, and assigning these sub-problems to distinct sub-networks. By isolating unambiguous concepts, each sub-network focuses solely on resolving dense temporal overlaps, thereby simplifying the overall problem. Furthermore, co-occurring actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements across different metrics on three challenging benchmark datasets, TSU, Charades, and MultiTHUMOS. Our code will be released upon paper publication.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1616
Loading