Abstract: Transformers have demonstrated potential in leveraging features for two-stage human-object interaction (HOI) detection, but a considerable performance gap persists compared to one-stage methods. We attribute this discrepancy to the limited granularity in the coverage of first-stage features. In this paper, we introduce a multi-cue injected Transformer (MIT), specifically devised for two-stage HOI detection. MIT efficiently utilizes multi-granularity information, encompassing cues related to instances, bounding boxes, 3D poses, and global context. Initially, MIT associates instances, such as humans and objects, that have potential interactive relationships using bounding box cues, subsequently fusing these instances with 3D pose to derive a fused embedding for each human-object pair. These embeddings are then refined by querying on global context feature maps. Through the hierarchical integration of these diverse cues, MIT substantially enhances HOI detection performance. Extensive experiments validate MIT’s effectiveness and its superiority to state-of-the-art methods.
Loading