Towards Open-vocabulary HOI Detection with Calibrated Vision-language Models and Locality-aware Queries

Zhenhao Yang; Xin Liu; Deqiang Ouyang; Guiduo Duan; Dongyang Zhang; Tao He; Yuan-Fang Li

Towards Open-vocabulary HOI Detection with Calibrated Vision-language Models and Locality-aware Queries

Zhenhao Yang, Xin Liu, Deqiang Ouyang, Guiduo Duan, Dongyang Zhang, Tao He, Yuan-Fang Li

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The open-vocabulary human-object interaction (Ov-HOI) detection aims to identify both base and novel categories of humanobject interactions while only base categories are available during training. Existing Ov-HOI methods commonly leverage knowledge distilled from CLIP to extend their ability to detect previously unseen interaction categories. However, our empirical observations indicate that the inherent noise present in CLIP has a detrimental effect on HOI prediction. Moreover, the absence of novel humanobject position distributions often leads to overfitting on the base categories within their learned queries. To address these issues, we propose a two-step framework named, CaM-LQ, Calibrating visual-language Models, (e.g., CLIP) for open-vocabulary HOI detection with Locality-aware Queries. By injecting fine-grained HOI supervision from the calibrated CLIP into the HOI decoder, our model can achieve the goal of predicting novel interactions. Extensive experimental results demonstrate that our approach performs well in open-vocabulary human-object interaction detection, surpassing state-of-the-art methods across multiple metrics on mainstream datasets and showing superior open-vocabulary HOI detection performance, e.g., with 4.54 points improvement on the HICO-DET dataset over the SoTA CLIP4HOI on the UV task with the same backbone ResNet-50.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Media Interpretation

Relevance To Conference: This work addresses the challenges in visual and linguistic (V&L) models for object-verb human-object interaction (Ov-HOI) detection. Firstly, it identifies the existence of inherent noise in V&L embeddings, which undermines HOI detection accuracy. Secondly, it introduces a novel approach to calibrate CLIP with HOI priors and fine-grained logit distillation, effectively mitigating the impact of noise. Thirdly, it innovates by injecting spatial priors into queries to decode pairwise HOI features, enhancing the model's focus on interaction points and nuanced relationships.

Supplementary Material: zip

Submission Number: 1481

Loading