Keywords: 1. Vision and Learning, object state change, human action, anticipation, procedural videos
TL;DR: The proposed set of methods tackle predicting human actions and object state changes in goal-oriented videos (e.g assembly) by fusing visual, semantic, and linguistic cues. It also introduces a new benchmark for object state change anticipation.
Abstract: Understanding and anticipating how human activities unfold over time is a core challenge in video understanding, with broad applications in robotics, assistive technologies, and scene interpretation. Real-world activities — such as cooking, assembly tasks, or object manipulation — are inherently goal-oriented processes, consisting of structured sequences of actions which lead to associated changes in objects’ states. In this context, two key challenges emerge: action anticipation, involving the prediction of the future procedural steps of an activity, and object state change anticipation, aimed at forecasting expected transformations of objects as a result of near-future, yet unseen human actions. Addressing these problems requires models capable of integrating the immediate visual context of an activity with its long-term semantic context, which relates to the history of actions and interactions. The proposed set of methods presents solutions to these challenges. Specifically, we combine visual information with semantic and linguistic cues to effectively capture both short-term and long-range temporal dependencies between actions, enabling models to achieve improved action anticipation. Additionally, we introduce a new benchmark and methodology targeting the task of object state change anticipation by leveraging curated annotations and multimodal inputs to connect prior object interactions with probable future transformations. Based on these contributions, we aspire to advance anticipatory video understanding by enabling AI systems to model and recognize human intention, the procedural structure of human activities, and the induced functional object transformations.
Published work: V. Manousaki, K. Bacharidis, K. Papoutsakis and A. Argyros. (2024) "VLMAH: Visual-Linguistic
Modeling of Action History for Effective Action Anticipation", In IEEE/CVF International Conference on
Computer Vision Workshops (ACVR 2023 - ICCVW 2023), IEEE, pp. 1917-1927, Paris, France, October 2023
https://doi.org/10.1109/ICCVW60793.2023.00206
ArXiv work: Manousaki, Victoria, et al. "Anticipating Object State Changes." arXiv preprint arXiv:2405.12789 (2024). https://arxiv.org/abs/2405.12789
Submission Number: 166
Loading