Abstract: We introduce a multi-stage framework that uses local geometry changes on a hand surface and focuses on learning interaction between a primary and assistive hand/object for hand action recognition in videos from a egocentric view RGB camera. Our method does not require 3D information of objects such as the 6D object pose which is difficult to annotate or the depth of the image requires additional a depth sensor for learning an objects’ behavior while it interacts with hands. Instead, the proposed method learns the changes within the surface of the hand, the hand type which is positively correlated with the hand action and the location of objects and hands in the 2D image space. The framework synthesizes the mean curvature of the primary hand mesh model to encode the hand surface geometry. Also, we introduce a feature pooling layer to handle diverse scenarios: having one hand, two hands, one hand with one object, and two hands with two objects. Our method outperforms the state-of-the-art hand action recognition methods that use 6D object poses of objects or a depth sensor.
Loading