Abstract: Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training.
Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion of the model during testing. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Zero-shot HOI tasks enabling the detection of human-object interactions (HOIs) in images without prior training on specific interaction classes. This capability expands the applicability of computer vision systems to diverse scenarios where new or unseen interactions may occur. By leveraging pre-trained models and semantic embeddings, zero-shot HOI methods bridge the semantic gap between visual content and natural language descriptions, facilitating more robust and adaptable interaction detection. Additionally, they promote cross-modal understanding by integrating visual features with textual information, thereby enriching the context and improving the accuracy of HOI recognition. Furthermore, zero-shot approaches foster innovation in multi-modal fusion techniques, encouraging the development of novel architectures that effectively combine visual and textual cues for enhanced performance in HOI detection tasks. Overall, zero-shot HOI tasks serve as a catalyst for advancements in multimedia processing, paving the way for more versatile and intelligent computer vision systems capable of understanding complex human-object interactions across various domains.
Supplementary Material: zip
Submission Number: 939
Loading