Actions-to-Action: Inductive Attention for Egocentric Video Action Anticipation

Tsung-Ming Tai; Giuseppe Fiameni; Oswald Lanz

Actions-to-Action: Inductive Attention for Egocentric Video Action Anticipation

Tsung-Ming Tai, Giuseppe Fiameni, Oswald Lanz

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: egocentric video, action anticipation, attention, recurrent

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Video action anticipation is a specific field within computer vision that diverges from action recognition, requiring the prediction of future actions through the analysis of historical video sequences. This paper unveils an innovative model designed to overcome the limitations of existing solutions by amalgamating recurrent and attention mechanisms, taking cues from the principles of object tracking. Notably, our model leverages prior anticipation results, enabling a nuanced interpretation of semantic transitions between actions and recognizing the uncertainty inherent in predicting future events. This strategy strikes a balance between computational efficiency and judicious data utilization, challenging the assumptions prevalent in current transformer models and thereby underlining its practicality for real-world applications. Distinctively, our model discerns temporal connection from abstract concepts in a way that mirrors human reasoning and adopts a recurrent structure to thoroughly capture video context. Extensive experiments conducted on EPIC-Kitchens-100, EPIC-Kitchens-55, and EGTEA Gaze+ confirm the superior performance of our proposed model and efficiency compared to established transformer architectures. Remarkably, it surpasses most multi-modality models using only RGB visual inputs, showcasing its exceptional generalization capabilities across a variety of unseen test sets.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6336

Loading