Towards Understanding Object-Directed Actions: A Generative Model for Grounding Syntactic Categories of Speech Through Visual Perception

Amir Aly, Tadahiro Taniguchi

2018 (modified: 03 Nov 2022)ICRA 2018Readers: Everyone

Abstract: Creating successful human-robot collaboration requires robots to have high-level cognitive functions that could allow them to understand human language and actions in space. To meet this target, an elusive challenge that we address in this paper is to understand object-directed actions through grounding language based on visual cues representing the dynamics of human actions on objects, object characteristics (color and geometry), and spatial relationships between objects in a tabletop scene. The proposed probabilistic framework investigates unsupervised Part-of-Speech (POS) tagging to determine syntactic categories of words so as to infer grammatical structure of language. The dynamics of object-directed actions are characterized through the locations of the human arm joints - modeled on a Hidden Markov Model (HMM) - while manipulating objects, in addition to those of objects represented in 3D point clouds. These corresponding point clouds to segmented objects encode geometric features and spatial semantics of referents and landmarks in the environment. The proposed Bayesian learning model is successfully evaluated through interaction experiments between a human user and Toyota HSR robot in space.

0 Replies