Textual Concept Expansion with Commonsense Knowledge to Improve Dual-Stream Image-Text Matching

Mingliang Liang, Zhuoran Liu, Martha A. Larson

2023 (modified: 17 Apr 2023)MMM (1) 2023Readers: Everyone

Abstract: We propose a Textual Concept Expansion (TCE) approach for creating joint textual-visual embeddings. TCE uses a multi-label classifier that takes a caption as input and produces as output a set of concepts that are used to expand, i.e., enrich the caption. TCE addresses the challenge of the limited number of concepts common between an image and its caption by leveraging general knowledge about the world, i.e., commonsense knowledge. Following a recent trend, the commonsense knowledge is acquired by creative use of the training data. We test TCE within a popular dual-stream approach, Consensus-aware Visual-Semantic Embedding (CVSE). This popular approach leverages a graph that encodes the co-occurrence of concepts, which it takes to represent a consensus between the textual and visual modality that captures commonsense knowledge. Experimental results demonstrate an improvement of image-text matching when TCE is used for the expansion of the background collection and the query. Query expansion, not possible in the original CVSE, is particularly helpful. TCE can be extended in the future to make use of data that is similar to the target domain, but is drawn from an additional, external data set.

0 Replies