Abstract: Action recognition is an important problem that requires
identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant
computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only
use human-keypoint information but suffer from a loss of
scene context that hurts accuracy.
In this paper, we describe an action-localization
method, KeyNet, that uses only the keypoint data for
tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to
capture context in the scene. Our method illustrates how
to build a structured intermediate representation that allows modeling higher-order interactions in the scene from
object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we
demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information
over AVA action and Kinetics datasets.
0 Replies
Loading