Learning to encode spatial relations from natural language

Tiago Ramalho; Tomas Kocisky‎; Frederic Besse; S. M. Ali Eslami; Gabor Melis; Fabio Viola; Phil Blunsom; Karl Moritz Hermann

Learning to encode spatial relations from natural language

Tiago Ramalho, Tomas Kocisky‎, Frederic Besse, S. M. Ali Eslami, Gabor Melis, Fabio Viola, Phil Blunsom, Karl Moritz Hermann

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Natural language processing has made significant inroads into learning the semantics of words through distributional approaches, however representations learnt via these methods fail to capture certain kinds of information implicit in the real world. In particular, spatial relations are encoded in a way that is inconsistent with human spatial reasoning and lacking invariance to viewpoint changes. We present a system capable of capturing the semantics of spatial relations such as behind, left of, etc from natural language. Our key contributions are a novel multi-modal objective based on generating images of scenes from their textual descriptions, and a new dataset on which to train it. We demonstrate that internal representations are robust to meaning preserving transformations of descriptions (paraphrase invariance), while viewpoint invariance is an emergent property of the system.

Keywords: generative model, grounded language, scene understanding, natural language

TL;DR: We introduce a system capable of capturing the semantics of spatial relations by grounding representation learning in vision.

Data: [CLEVR](https://paperswithcode.com/dataset/clevr), [SHAPES](https://paperswithcode.com/dataset/shapes-1)

7 Replies

Loading