Coupled-dynamic learning for vision and language: Exploring Interaction between different tasks

Ning Xu, Hongshuo Tian, Yanhui Wang, Weizhi Nie, Dan Song, An-An Liu, Wu Liu

2021 (modified: 10 Apr 2022)Pattern Recognit. 2021Readers: Everyone

Abstract: Highlights • We propose a novel coupled dynamic framework that can exploit the complementary knowledge learning between different tasks, where the image captioning and synthesis tasks can be synchronously trained to reduce the distance between task dependent dynamics effectively. • To embed adverse information into individual network, we construct a dual loss architecture to connect different tasks. Particularly, the novel message interaction unit is proposed to interactively align task dependent dynamics. To improve optimization strategies, we decompose the objective function into three consecutive steps, which allows the use of adadelta gradient algorithms in general back propagation problems. • We perform comprehensive evaluations on three image benchmarks. Our framework can achieve the competing performances against state of the art methods. Furthermore, we exploit various alignment formulas and generalizat ion properties for the couple dynamic interactive learning framework. Abstract Intensive research interests have been paid for the vision and language communities. Especially, image captioning task aims to generate natural language descriptions from the image content. Oppositely, image synthesis task aims to generate realistic images from natural language descriptions. Moreover, both of them can achieve promising results by using Long Short-Term Memory (LSTM), which models the sequence dynamics at each time step as hidden state. Nevertheless, the research on dynamics is often limited in the individual task, while there is no progress exploring the mutual relationship between dynamics in different tasks. In this work, we present a novel coupled-dynamic formulation that can iteratively reduce the distance between task-dependent dynamics in the training process. To embed adverse information into individual network, we construct dual-loss architectures to interactively align dynamics. We evaluate the proposed framework on Flickr8k, Flickr30k and MSCOCO datasets. Experimental results show that our approach can boost dual tasks together and achieve competing performances against state-of-the-art methods.

0 Replies