- Abstract: Human world knowledge is both structured and flexible. When people see an object, they represent it not as a pixel array but as a meaningful arrangement of semantic parts. Moreover, when people refer to an object, they provide descriptions that are not merely true but also relevant in the current context. Here, we combine these two observations in order to learn fine-grained correspondences between language and contextually relevant geometric properties of 3D objects. To do this, we employed an interactive communication task with human participants to construct a large dataset containing natural utterances referring to 3D objects from ShapeNet in a wide variety of contexts. Using this dataset, we developed neural listener and speaker models with strong capacity for generalization. By performing targeted lesions of visual and linguistic input, we discovered that the neural listener depends heavily on part-related words and associates these words correctly with the corresponding geometric properties of objects, suggesting that it has learned task-relevant structure linking the two input modalities. We further show that a neural speaker that is `listener-aware' --- that plans its utterances according to how an imagined listener would interpret its words in context --- produces more discriminative referring expressions than an `listener-unaware' speaker, as measured by human performance in identifying the correct object.
- Keywords: Referential Language, 3D Objects, Part-Awareness, Neural Speakers, Neural Listeners
- TL;DR: How to build neural-speakers/listeners that learn fine-grained characteristics of 3D objects, from referential language.