From Video Classification to Action Detection: Foundation vs. Task-Specific Models

Published: 09 Jun 2025, Last Modified: 09 Jun 2025FMSD @ ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video-level classification; Frame-level classification; Action Detection; Action Localization; Saliency-Maps
TL;DR: Most skeleton datasets lack frame-level labels, limiting action detection. We generate pseudo-labels from video-level annotations via saliency maps, enabling fine-grained motion analysis.
Abstract: Real-time action detection demands fine-grained supervision, yet most skeleton based datasets only provide video-level annotations, due to the high cost, subjectivity, and time-consuming nature of frame-level labeling. To bridge this gap, we propose a pipeline that transforms video-level annotations into frame-level pseudo-labels via saliency maps. This approach significantly reduces the need for manual labeling while enabling frame-level action detection. We evaluate our method using both structured foundation models and task-specific architectures for action recognition (daily activities and rehabilitation) across four diverse datasets: SERE, Toronto Rehab, UTK and MMAct. These results highlight the generalization potential across users of the foundation models trained on structured time-series data, offering an efficient route from video-level labels to fine-grained motion analysis.
Submission Number: 56
Loading