Understanding Descriptions of Visual Scenes Using Graph Grammars

Daniel Bauer

2013 (modified: 16 Jul 2019)AAAI 2013Readers: Everyone

Abstract: Teaching computers to understand the meaning of natural language text has long been an important goal for Artificial Intelligence. The focus of my work is on the interpretation of descriptions of visual scenes such as ‘A man is sitting on a chair and using the computer’. One application of this research is the automatic generation of 3D scenes (Coyne and Sproat 2001), such as the one in Figure 1 c). Text-toscene generation systems provide a way for non-artists to create graphical content and have wide-ranging applications in communication, entertainment, and education. The formal meaning representations in today’s natural language processing systems are usually limited to basic predicate-argument structure and coarse word sense. Such representations do not support inference and are not sufficiently detailed to visualize a scene. In my thesis I am developing techniques for semantic parsing into a new type of meaning representation encoded as directed graphs. Graphs conveniently capture coreference and the hierarchical nature of meaning. My meaning representations contain two or more levels of granularity. The graph directly derived from the input text (the high-level representation) describes functional aspects of the scene (who does what to whom, Figure 1 a). It can be rewritten into a low-level graph that contains concepts and relations that are more basic (Figure 1 b) and eventually into conceptual primitives. For visual scenes, these low-level graphs express the basic spatial relations between objects. This meaning representation scheme is based on two powerful ideas in natural language understanding: decomposing word meaning into conceptual primitives to support inference (Schank 1972) and describing word meaning not as isolated fragments but as part of a larger conceptual frame. In particular I build on the frame semantic theory by (Fillmore 1982) and its implementation in the FrameNet lexical resource (Fillmore, Johnson, and Petruck 2003). Fillmore’s frame semantics focuses on valence patterns of a lexical item as the link between syntactic realization and elements of the conceptual frame surrounding it. There are a number of systems using FrameNet as training data to automatically annotate frame semantic structure on text (e.g. Das et al. 2010). FrameNet representations, however, are shallow: frames do not contain any internal structure and frame elements are as-

0 Replies