Abstract: Most current research on automatically captioning and describing scenes with spatial content focuses on images. We outline
that generating descriptive text for a synthesized 3D scene can be achieved via a suitable intermediate representation employed
in the synthesis algorithm. As an example, we synthesize scenes of medieval village settings, and generate their descriptions.
Our system employs graph grammars, Markov Chain Monte Carlo optimization, and a natural language generation pipeline.
Randomly placed objects are evaluated and optimized by a cost function capturing neighborhood relations, path layouts, and
collisions. Further, in a pilot study we assess the performance of our framework by comparing the generated descriptions to
others provided by human subjects. While the latter were often short and low-effort, the highest-rated ones clearly outperform
our generated ones. Nevertheless, the average of all collected human descriptions was indeed rated by the study participants
as being less accurate than the automated ones.
Loading