Keywords: Weakly-Supervised Learning, Long-term Action Anticipation, Language-based supervision, Video Understanding
TL;DR: Long-Term Action Anticipation via Transcript-based Supervision
Abstract: Long-Term Anticipation (LTA) from video is a crucial task in computer vision, with significant implications for human-machine interaction, robotics, and beyond. However, to date, it has been tackled exclusively in a fully supervised manner, by relying on dense frame-level annotations that hinder scalability and limit real-world applicability. To address this limitation, we introduce TbLTA (Transcript-based LTA), the first weakly-supervised approach for LTA, which relies solely on video transcripts during training. This high-level semantic supervision provides the narrative temporal structure that can guide the model toward understanding the relationships between events over time. Our model is built on an encoder-decoder architecture, which is trained using dense pseudo-labels generated by a temporal alignment module to supervise the predictions of both the segmentation head and the anticipation decoder. In addition, the video transcript itself is also used for 1) enhancing video features by contextually grounding them through cross-modal attention, 2) supplying a more global supervision to the model action segmentation predictions over the full video, which in turn helps to provide a better contextualized representation to the anticipation decoder. Through experiments on the Breakfast, 50Salads, and EGTEA benchmarks, we demonstrate that transcript-based supervision offers a very robust and less costly alternative to its fully supervised counterpart for the LTA task.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19797
Loading