Efficient Guided Query Network for Human-Object Interaction Detection

Published: 2024, Last Modified: 14 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, Transformer-based one-stage methods have demonstrated excellent efficiency in Human-Object Interaction (HOI) tasks. However, these methods often utilize semantically ambiguous initial queries, thus constraining the model’s ability for set prediction. In addition, currently widely used HOI datasets suffer from long-tail distribution issues, so accurately identifying rare interaction categories remains challenging. To address these challenges, we propose an Efficient Guided Query Network (EGQ-Net). The network introduces a forward-guided relational queries approach, which accurately captures the triplets of interaction relationships by effectively integrating the initial queries predicted by the encoder and the output features of each decoder layer. Furthermore, we used the visual language pre-training models CLIP and BLIP2 to design interaction position query guidance and interaction content query guidance to achieve accurate recognition and localization of interactive areas by queries. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on widely used HOI benchmarks (V-COCO and HICO-DET).
Loading