Keywords: Visual Dialog, Self-Training, Semi-Supervised Learning, Dialogue Generation, Vision-and-Language
TL;DR: We propose a semi-supervised learning approach for Visual Dialog that generates the synthetic dialog data for unlabeled images and trains the dialog agent with the data.
Abstract: Visual dialog (VisDial) is a task of answering a series of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog models solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for VisDial, called Generative Self-Training (GST), to enhance the pre-training. Specifically, GST generates synthetic dialog data for unlabeled images via multimodal conditional text generation and trains the dialog model on the synthetic and the original VisDial data. Moreover, we also propose perplexity-based data selection and multimodal consistency regularization for robust training of the synthetic data. Evaluation on VisDial v1.0 dataset shows that GST improves the pre-training and achieves new state-of-the-art results.