Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Published: 01 Jan 2023, Last Modified: 14 Aug 2024IEEE Trans. Artif. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Affordance detection, which refers to perceiving objects with potential action possibilities in images, is a challenging task since the possible affordance depends on the person's purpose in real-world application scenarios. The existing works mainly extract the inherent human–object dependencies from image/video to accommodate affordance properties that change dynamically. In this article, we explore to perceive affordances from a vision-language perspective, and consider the challenging phrase-based affordance detection task, i.e., given a set of phrases describing the potential actions, all the object regions in a scene with the same affordance should be detected. To this end, we propose a cyclic bilateral c onsistency enhancement network (CBCE-Net) to align language and vision features in a progressive manner. Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module that facilitates the perception of possible interaction with objects in a cyclic manner. In addition, we extend the public purpose-driven affordance dataset (PAD) by annotating affordance categories with short phrases. The extensive contrastive experimental results demonstrate the superior performance of our method over nine typical methods from four relevant fields in terms of both objective metrics and visual quality.
Loading