Interaction-centric Hypersphere Reasoning for Multi-person Video HOI Recognition

21 Sept 2023 (modified: 27 Feb 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Multi-person Video HOI recognition, Interaction-centric, Hypersphere reasoning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We tackle multi-person video HOI recognition with interaction-centric hypersphere reasoning framework.
Abstract: Human-object interaction (HOI) recognition in videos represents a fundamental yet intricate challenge in computer vision, requiring perception and reasoning across both spatial and temporal domains, espically in multi-person scenes. HOI encompasses humans, objects, and the interactions that bind them. These three facets exhibit interconnectedness and exert mutual influence upon one another. However, contemporary video HOI recognition methods focus on the utilization of disentangled representations, neglecting their inherent interdependencies. Our assertions are that these facets are inherently interdependent and that interactions hold essential semantic meaning in HOIs. In light of this, we propose an interaction-centric hypersphere reasoning model for multi-person video HOI recognition. Specifically, we design a context fuser to model the interdependencies among humans, objects and interactions. To encapsulates the semantic essence of video HOIs, our model adopts an interaction-centric hypersphere framework. Furthermore, to enable the model with the capacity for temporal reasoning, we introduce an interaction state reasoner module. Consequently, our model unravels the intricacies of HOI recognition and is felxiable for both multi-person and single-person scenarios. Empirical results on multi-person video HOI dataset MPHOI-72 indicates that our method surpasses state-of-the-art (SOTA) method by more than 15%. At the same time, on single-person datasets Bimanual Actions (single-human two-hand HOI) and CAD-120 (single-human HOI), our method achieves on par or even better results compared with SOTA methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3816
Loading