Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Published: 01 Jan 2024, Last Modified: 08 Apr 2025Appl. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human-object interaction (HOI) detection is an important computer vision task for recognizing the interaction between humans and surrounding objects in an image or video. The HOI datasets have a serious long-tailed data distribution problem because it is challenging to have a dataset that contains all potential interactions. Many HOI detectors have addressed this issue by utilizing visual-language models. However, due to the calculation mechanism of the Transformer, the visual-language model is not good at extracting the local features of input samples. Therefore, we propose a novel local feature enhanced Transformer to motivate encoders to extract multi-modal features that contain more information. Moreover, it is worth noting that the application of prompt learning in HOI detection is still in preliminary stages. Consequently, we propose a multi-modal adaptive prompt module, which uses an adaptive learning strategy to facilitate the interaction of language and visual prompts. In the HICO-DET and SWIG-HOI datasets, the proposed model achieves full interaction with 24.21% mAP and 14.29% mAP, respectively. Our code is available at https://github.com/small-code-cat/AMP-HOI.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview