Grounding Language Representation with Visual Object Information via Cross Modal PretrainingDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Grounded Language Learning, Language Model
Abstract: Previous studies of visual grounded language learning use a convolutional neural network (CNN) to extract features from the whole image for grounding with the sentence description. However, this approach has two main drawbacks: (i) the whole image usually contains more objects and backgrounds than the sentence itself; thus, matching them together will confuse the grounded model; (ii) CNN only extracts the features of the image but not the relationship between objects inside that, limiting the grounded model to learn complicated contexts. To overcome such shortcomings, we propose a novel object-level grounded language learning framework that empowers the language representation with visual object-grounded information. The framework is comprised of three main components: (i) ObjectGroundedBERT captures the visual-object relations and literary portrayals by cross-modal pretraining via a Text-grounding mechanism, (ii) Visual encoder represents a visual relation between objects and (iii) Cross-modal Transformer helps the Visual encoder and ObjectGroundedBERT learn the alignment and representation of image-text context. Experimental results show that our proposed framework consistently outperforms the baseline language models on various language tasks of GLUE and SQuAD datasets.
One-sentence Summary: We propose a novel object-level grounded language learning framework that empowers the language representation with visual object-grounded information.
4 Replies

Loading