Learning to Predict Visually What Comes Next: Human Action and Object State Anticipation

Victoria Manousaki; Konstantinos Bacharidis; Filippos Gouidis; Konstantinos Papoutsakis; Dimitris Plexousakis; Antonis Argyros

Learning to Predict Visually What Comes Next: Human Action and Object State Anticipation

Victoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis, Konstantinos Papoutsakis, Dimitris Plexousakis, Antonis Argyros

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 1. Vision and Learning, object state change, human action, anticipation, procedural videos

TL;DR: The proposed set of methods tackle predicting human actions and object state changes in goal-oriented videos (e.g assembly) by fusing visual, semantic, and linguistic cues. It also introduces a new benchmark for object state change anticipation.

Abstract: Understanding and anticipating how human activities unfold over time is a core challenge in video understanding, with broad applications in robotics, assistive technologies, and scene interpretation. Real-world activities — such as cooking, assembly tasks, or object manipulation — are inherently goal-oriented processes, consisting of structured sequences of actions which lead to associated changes in objects’ states. In this context, two key challenges emerge: action anticipation, involving the prediction of the future procedural steps of an activity, and object state change anticipation, aimed at forecasting expected transformations of objects as a result of near-future, yet unseen human actions. Addressing these problems requires models capable of integrating the immediate visual context of an activity with its long-term semantic context, which relates to the history of actions and interactions. The proposed set of methods presents solutions to these challenges. Specifically, we combine visual information with semantic and linguistic cues to effectively capture both short-term and long-range temporal dependencies between actions, enabling models to achieve improved action anticipation. Additionally, we introduce a new benchmark and methodology targeting the task of object state change anticipation by leveraging curated annotations and multimodal inputs to connect prior object interactions with probable future transformations. Based on these contributions, we aspire to advance anticipatory video understanding by enabling AI systems to model and recognize human intention, the procedural structure of human activities, and the induced functional object transformations. Published work: V. Manousaki, K. Bacharidis, K. Papoutsakis and A. Argyros. (2024) "VLMAH: Visual-Linguistic Modeling of Action History for Effective Action Anticipation", In IEEE/CVF International Conference on Computer Vision Workshops (ACVR 2023 - ICCVW 2023), IEEE, pp. 1917-1927, Paris, France, October 2023 https://doi.org/10.1109/ICCVW60793.2023.00206 ArXiv work: Manousaki, Victoria, et al. "Anticipating Object State Changes." arXiv preprint arXiv:2405.12789 (2024). https://arxiv.org/abs/2405.12789

Submission Number: 166

Loading