A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Published: 20 Jul 2024, Last Modified: 02 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called Context-Enhanced Feature Alignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.

Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia, [Generation] Multimedia Foundation Models
Relevance To Conference: This work on human-object interactions (HOI) significantly contributes to the fields of multimedia and multimodal processing by enhancing the understanding and interpretation of complex scenes where humans interact with various objects. By accurately identifying and classifying the interactions between humans and objects within images or videos, this task enables more sophisticated content analysis, retrieval, and recommendation systems that are sensitive to the context and semantics of the media. In multimedia processing, HOI recognition facilitates the creation of smarter systems that can provide detailed descriptions of visual content, improve the accuracy of tagging and indexing for efficient searching, and enable the generation of rich metadata. In the realm of multimodal processing, HOI plays a crucial role in integrating information from different modalities, such as combining visual cues with textual descriptions or sensor data to create a comprehensive understanding of a scene. For instance, in interactive applications like virtual reality or robotics, recognizing HOIs helps in creating more natural and intuitive interactions between humans and computer-generated environments or autonomous agents. Overall, the advancements in HOI detection and classification directly enhance the capability of multimedia and multimodal systems to process and interpret data in a way that is closer to human-level understanding.
Supplementary Material: zip
Submission Number: 1437
Loading