Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions as <human, action, object> triplets. Recent advancements in pre-trained vision-language model (VLM) have improved zero-shot HOI detection, enabling identification of unseen triplets. However, existing methods leverage the VLM as an additional encoder only for interaction prediction, not for human/object detection. This limitation hinders their ability to detect unseen objects. Furthermore, the additional encoder increases both model size and computational cost. This paper proposes a novel HOI detection framework ECI-HOI, which unleashes potentials of the pre-trained VLM for the zero-shot HOI detection by leveraging it for both of the sub-tasks. We first employ CLIP as a single image encoder, reducing redundancy in the network architecture. In addition, we propose an instance selector and a HO pair decoder to effectively harmonize the human/object detection and the interaction prediction in zero-shot manner. We evaluate our model under various settings on HICO-DET and our two new testsets: out-of-distribution image testset and novel object testset. Our model outperforms the state-of-the-art models while reducing the model size by more than 50%, especially achieving a + 1 0.01 mAP improvement under the unseen object setting on HICO-DET. The results on the proposed datasets highlight the zero-shot performance of our model on more challenging settings.
Loading