Adversarial Learning for Visual Storytelling with Sense Group Partition

Lingbo Mo, Chunhong Zhang, Yang Ji, Zheng Hu

2018 (modified: 09 Nov 2021)ACCV (4) 2018Readers: Everyone

Abstract: Visual storytelling aims to investigate the generation of a paragraph to describe the content of a photo stream. Despite the substantial progress in vision and language research, the techniques for sequential vision-to-language are still far away from being perfect. Due to the limitation of maximum likelihood estimation on training, the majority of existing models encourage high resemblance to texts in the training database, which makes the description overly rigid and lack in diverse expressions. Therefore, We cast the task as a reinforcement learning problem and propose an Adversarial All-in-one Learning (AAL) framework to learn a reward model, which simultaneously incorporates the information of all images in the photo stream and all texts in the paragraph, and optimize a generative model with the estimated reward. Specifically, in light of the linguistic reading theory with sense group as the unit, we propose to do the paragraph generation at sense group level instead of sentence level. Experiments on the widely-used dataset show that our approach generates higher-quality descriptions than previous baselines.

0 Replies