Keywords: mesh-based action recognition, motion capture, transformer
TL;DR: We propose the first mesh-based action recognition method which achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
Abstract: We study the problem of human action recognition using motion capture (MoCap) sequences. Existing methods for MoCap-based action recognition take skeletons as input, which requires an extra manual mapping step and loses body shape information. Therefore, we propose a novel method that directly models raw mesh sequences which can benefit from the body prior and surface motion. We propose a new hierarchical transformer with intra- and inter-frame attention to learn effective spatial-temporal representations. Moreover, our model defines two self-supervised learning tasks, namely masked vertex modeling and future frame prediction, to further learn global context for appearance and motion. Our model achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models. We will release our code and models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Supplementary Material: zip
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/stmt-a-spatial-temporal-mesh-transformer-for/code)
1 Reply
Loading