Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Yu-Wei Zhan, Fan Liu, Xin Luo, Xin-Shun Xu, Liqiang Nie, Mohan Kankanhalli

Published: 27 Oct 2025, Last Modified: 19 Dec 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Human-Object Interaction (HOI) detection involves detecting human-object pairs and predicting their interactions. However, it faces significant challenges due to the complexity of human behavior and the diverse contexts in which interactions occur. Contextual cues, such as the participants involved, body language, and the surrounding environment, are crucial for accurately identifying interactions, particularly those that are ambiguous or previously unseen. In this paper, we propose ConCue, a novel approach that integrates contextual cue generation with feature extraction to enhance HOI detection. Specifically, we design specialized prompts tailored for Large Vision-Language Models (VLMs), enabling the generation of rich contextual cues from images. These cues are then seamlessly integrated into HOI detection through a feature extraction module with a multi-tower architecture we developed, which effectively incorporates contextual information into both instance and interaction detection processes. Extensive experimental results demonstrate the effectiveness of ConCue. Integrating ConCue with state-of-the-art HOI methods leads to significant performance improvements on two widely used benchmark datasets, highlighting the potential of our approach in advancing HOI detection.
Loading