Abstract: In human-robot interaction, humans and robots should engage in natural dialogue by considering their respective perspectives on objects in a shared space. However, existing methods in 3D Dense Captioning do not support the generation of descriptions conditioned on arbitrary viewpoints. To address this issue, this paper proposes a method that incorporates viewpoint information, distinguishing between the target object and a reference object that defines its spatial relationship. This enables the method to adjust descriptions appropriately according to changes in viewpoint. The effectiveness of the proposed method is validated through both quantitative and qualitative evaluations.
External IDs:dblp:conf/mva/IrisawaIYSHOO25
Loading