Spatio-temporal Diffusion Transformer for Action Recognition

Jing Gu; Yusong Bai; Desheng Zhai; Biao Hou; Shasha Mao; Shuyuan Yang; Licheng Jiao

Spatio-temporal Diffusion Transformer for Action Recognition

Jing Gu, Yusong Bai, Desheng Zhai, Biao Hou, Shasha Mao, Shuyuan Yang, Licheng Jiao

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video action recognition, fine-grained action, information diffusion, spatiotemporal feature

Abstract: Video action recognition has aroused the research interest of many scholars, and has been widely used in public surveillance, video review, sports events and other fields. However, the high similarity of video background and the long time span of video action bring serious challenges to action recognition. In this work, we propose a spatio-temporal diffusion transformer (STD-Former) to improve the recognition accuracy of long-distance and fine-grained actions. STD-Former utilizes a two-branch network to extract the spatiotemporal and temporal information of video respectively. First, we construct a parallel transformer module to capture the spatiotemporal feature of actions through a two-dimensional convolutional structure in the spatiotemporal branch. Secondly, a cross transformer module integrating the feature of spatiotemporal branch is presented to explore the long-distance temporal dependency relationship of video actions in the temporal branch. In addition, we design a novel plug-and-play spatiotemporal diffusion module, which feeds back the feature extracted from the temporal branch to the spatiotemporal branch, thus enhancing the action capture ability of model. Finally, in order to learn the fine-grained action information of adjacent video sequences, another plug-and-play significant motion excitation module is established by converting the spatial information of adjacent video frames into the motion feature. The experimental results on Something Something V1 and V2 datasets demonstrate that STD-Former can more accurately identify the fine-grained action and has favorable robustness than the current state-of-the-art action recognition models.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10098

Loading