Com-STAL: Compositional Spatio-Temporal Action Localization

Shaomeng Wang, Rui Yan, Peng Huang, Guangzhao Dai, Yan Song, Xiangbo Shu

Published: 01 Jan 2023, Last Modified: 31 Mar 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0
Abstract: Spatio-temporal action localization aims to locate the spatial and temporal positions of actors and classify their actions. However, prior research overlooks the fact that human actions often interact with novel objects in real-world scenarios, which neglects the various combinations of action-object, and considerably limits the generalization of the developed models. In this paper, we study the action-object combinations by researching multi-modal vision information of them. To this end, we propose a novel compositional spatio-temporal action localization (Com-STAL) task, which features non-overlapping action-object combinations in their training and test sets. Based on this, we construct a compositional action localization dataset (Com-AD). Beyond that, we propose a simple yet effective framework, Instance-Centric Interaction Network (ICIN), to reduce invalid induction biases within the visual modality and alleviate the combined distribution bias issue by leveraging additional modal information. The extensive experiment results on Com-AD demonstrate superior action localization performance of ICIN.
Loading