Abstract: Human-object interaction (HOI) detection is a core problem in human-centric scene understanding, which is devoted to inferring triplets < human, verb, object > between humans and objects. Previous works mainly determine the interaction of each human-object pair by performing joint inference based on multiple features. In this paper, we design more discriminative representation of the human-object pair and a more effective HOI detection model. On the one hand, we use human poses as an attention mechanism to strengthen features, which is a novel way to deal with human poses in HOI detection. On the other hand, for a more effective representation of objects, a word vector is used to encode objects, and the relation features of humans and objects are captured by a graph convolution network based on object word vectors and human appearance features. These relation features are also strengthened by a human pose attention mechanism. Our model yields favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets, V-COCO and HICO-DET.
External IDs:dblp:journals/mta/DengZLDH22
Loading