Zero-shot Human-Object Interaction Recognition by Bridging Generative and Contrastive Image-Language ModelsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Zero-shot, knowledge distillation, Human-Object Interaction
TL;DR: Our zero-shot HOI classifier outperforms supervised SOTAs by using a heterogeneous teach-student framework which bridges generative and contrastive pre-trained image-language models through pseudo-label distillation.
Abstract: Existing studies in Human-Object Interaction (HOI) recognition rely heavily on costly human-annotated labels, limiting the application of HOI in real-world scenarios like retail and surveillance. To address this issue, this paper investigates a new zero-shot setup where no HOI labels are available for any image. We propose a novel heterogenous teacher-student framework that bridges two types of pre-trained models, namely contrastive (e.g., CLIP) and generative (e.g., GIT) image-language models. To bridge their gap, we introduce pseudo-label distillation to extract HOI probabilities from image captions to train the student classifier. Our method leverages the complementary strengths of both models. As a result, the student model has "the best of two worlds", e.g., the compact backbone of a contrastive model and the fine-grained discriminability of a generative (captioning) model. It achieves 49.6 mAP on the HICO dataset without any ground-truth labels, becoming a new state-of-the-art that outperforms previous supervised approaches. Code will be released upon acceptance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
4 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview