- Abstract: In this work, we propose a goal-driven collaborative task that contains language, vision, and action in a virtual environment as its core components. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human agents. We define protocols and metrics to evaluate the effectiveness of learned agents on this testbed, highlighting the need for a novel "crosstalk" condition which pairs agents trained independently on disjoint subsets of the training data for evaluation. We present models for our task, including simple but effective baselines and neural network approaches trained using a combination of imitation learning and goal-driven training. All models are benchmarked using both fully automated evaluation and by playing the game with live human agents.
- Keywords: CoDraw, collaborative drawing, grounded language
- TL;DR: We introduce a dataset, models, and training + evaluation protocols for a collaborative drawing task that allows studying goal-driven and perceptually + actionably grounded language generation and understanding.