Scoping Review on Image-Text Multimodal Machine Learning Models

TMLR Paper2199 Authors

14 Feb 2024 (modified: 16 Feb 2024)Withdrawn by AuthorsEveryoneRevisionsBibTeX
Abstract: Multimodal machine learning (MMML) has emerged as a promising topic with the ability to jointly utilize data from several data modalities to improve performance and address difficult real-world problems. Large-scale multimodal datasets and the availability of powerful computing resources have sped up the development of sophisticated deep learning architectures that are designed for multimodal data. In this paper, we conducted a systematic literature review focusing on the deep learning architectures used in MMML that combine image and text modalities. The objective of this paper includes looking at various models and deep learning architectures used in MMML, learning about the fusion techniques used to combine both modalities and analyze their performance and limitations of these models. For this purpose, we have garnered 341 research articles from 5 digital library database and after an extensive review process, we have 88 research papers that allow us to thoroughly assess MMML. Our findings from these papers shed light on providing new directions for further study in this evolving and interdisciplinary domain.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Antoni_B._Chan1
Submission Number: 2199
Loading