HOIG: End-to-End Human-Object Interactions Grounding with Transformers

Zeyu Ma, Ping Wei, Huan Li, Nanning Zheng

2022 (modified: 19 Nov 2022)ICME 2022Readers: Everyone

Abstract: Visual grounding is a crucial and challenging problem in many applications. While it has been extensively investigated over the past years, human-centric grounding with multiple instances is still an open problem. In this paper, we introduce a new task of Human-Object Interactions (HOI) Grounding to localize all the referring human-object pair instances in an image with a given ⟨human, interaction, object⟩ phrase. We design an encoder-decoder architecture to model the task as a set prediction problem based on transformers. A vision-language alignment module and a grounding decoder are designed to learn accurate cross-modal contexts and interactions. Our model accomplishes alignment and prediction in an end-to-end manner without pre-trained detectors or post-processing. Experiments on two challenging datasets prove the strength of our model.

0 Replies