Composing Human Object Interaction with Decoupled Prototype for Zero-shot Learning

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human object detection, zero-shot learning
Abstract: Zero-shot Human-Object Interaction (HOI) detection is a daunting problem, largely stemming from the combinatorial explosion of potential action-object pairs. Current studies predominantly address this issue by transferring knowledge from large-scale pre-trained models (e.g., CLIP), yet ignore a more straightforward idea, i.e., mimic the powerful compositional generalization ability of human intelligence based on past cases. Besides, they simplify this combinatorial challenge by operating under the assumption that knowledge about unseen compositions is accessible, which is usually impractical in reality. In this work, we extend prior Closed-World zero-shot setting to an Open-World scenario, where the search space for HOI compositions is entirely unrestricted. For this challenging task, we introduce ProtoHOI, a fresh prototype-based framework for zero-shot HOI detection, which consists of: i) distill a set of prototypes from HOI proposal embeddings to model the inherent properties of objects and actions in the context of HOI. ii) recalibrate the representation space learned by the HOI detector based on these derived prototypes in a decoupled manner, thereby facilitating the prediction of unseen HOI compositions. Extensive experiments on two standard benchmarks demonstrate the superiority of ProtoHOI over the state-of-the-art methods across all zero-shot settings. The source code will be released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1261
Loading