Dual-Branch Knowledge Enhancement Network with Vision-Language Model for Human-Object Interaction Detection

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, pre-trained Vision-Language Models (VLM) have shown their great recognition ability in HOI detection task. However, these VLM based methods are struggle to transfer knowledge to achieve desired performance. To this end, we propose a Dual-Branch Knowledge Enhancement Network with VLM (DBKEN-VLM) within the two-stage paradigm to enhance the effectiveness of VLM. Specifically, we propose a semantic mining decoder to supplement contextual and action-related semantic information into our model. It forms a dual-branch knowledge enhancement network with spatial guided decoder. Furthermore, we propose a two-level fusion strategy for the dualbranch network to facilitate better knowledge transfer of VLM. One is feature-level fusion, producing more instructive interaction features; another is decision-level fusion, further enhancing the capability of VLM for HOI detection. The proposed method achieves competitive performance compared to recent methods on two benchmark datasets, HICO-DET and V-COCO.
Loading