Vision and Language Integration Meets Multimedia Fusion

Marie-Francine Moens, Katerina Pastra, Kate Saenko, Tinne Tuytelaars

2018 (modified: 03 Nov 2022)IEEE Multim. 2018Readers: Everyone

Abstract: Multimodal information fusion at both the signal and semantics level is a core part of most multimedia applications, including indexing, retrieval, and summarization. Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning.1 Vision and language are two of the predominant modalities that are fused, with a long history of results in TRECVid, ImageClef, and other international challenges. During the last decade, vision– language semantic integration has attracted attention from traditionally non-interdisciplinary research communities such as computer vision and natural language processing. This is due to the fact that one modality can greatly assist in processing another by providing cues for disambiguation, complementary information, and noise/error filtering. Recent advances in deep learning have opened up new opportunities in joint modeling of visual and co-occurring verbal information in multimedia.

0 Replies