IMAGINATOR - A Pre-Trained Image+Text Joint EmbeddingDownload PDF

Anonymous

17 Feb 2023 (modified: 05 May 2023)ACL ARR 2023 February Blind SubmissionReaders: Everyone
Abstract: Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis "You shall know a word by the company it keeps", whereas modern prediction-based neural network embeddings rely on design choices and hyperparameter optimisation. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs. JE is a way to encode multimodal data into a vector space where the text modality serves as the grounding key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three individual representations: (i) object-object collocation, (ii) word-object collocation, and (iii) word-object correlation. These three ways capture complementary aspects/knowledge of the two modalities which are further combined to obtain the final JEs. We evaluate pre-trained IMAGINATOR JEs on three distinct tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned downstream tasks by outperforming the current SoTA on all the selected tasks.Generated JEs are also intrinsically evaluated to assess how well they capture the contextuality and real-world analogies - based on word analogies and using corresponding images. IMAGINATOR will be made publicly available.
Paper Type: long
Research Area: Resources and Evaluation
0 Replies

Loading