Decoupled Human-Object Interaction Inference Addressing Architectural Order Dependency in the Query-based Model
Keywords: Human-Object Interaction, Query-based Model
TL;DR: This paper establishes an evaluation framework for the field of human-object interaction inference and advances the query-based model by introducing architectural independency and data cleansing.
Abstract: Human-object interaction (HOI) inference is a crucial component of end-to-end HOI detection, responsible for predicting the interactions between humans and objects in an image. While query-based detectors have achieved state-of-the-art performance in HOI detection, their interaction inference modules are typically tightly coupled with the detection pipeline, hindering independent evaluation and optimization. Recent research suggests that decoupling this module can improve overall detection, yet its standalone effectiveness remains underexplored. To this end, we introduce a dedicated evaluation framework for isolated HOI inference modules and identifies two key factors limiting current performance: architectural order dependency and dataset impurities. To address these issues, we propose a novel interaction inference model that removes self-attention from the decoder and introduce dataset refinement strategies, including verb clustering and redundant bounding-box unification. Extensive experiments on multiple benchmarks demonstrate that our approach surpasses existing inference modules by an average of 20%, confirming its effectiveness and robustness, and the optimization of the decoupled interaction inference model further improves the end-to-end model.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 18342
Loading