Abstract: With the widespread availability of multiple data sources, such as image, audio-video, and text data, automatic summarization of multimodal data is becoming an important technology in decision support. This paper presents a comprehensive survey and summary of the main articles in the field of multimodal summarization techniques in recent years. Firstly, we define multimodal summarization and briefly describe the development process. Then, we survey existing techniques and their applicability in different domains. Additionally, we provide an analysis of their results and discuss the insights of those approaches, along with the challenges and future research directions. Based on our study, we found that the encoder-decoder approach is currently the best approach for automated summarization. In the future, we believe that the applications of multimodal summarization could develop rapidly in many different fields, particularly in medicine. In our case studies, we demonstrate that multimodal learning is a promising research direction for providing timely and accurate summarizations compared to unimodal approaches.
Loading