Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?

Published: 01 Jan 2024, Last Modified: 17 Jul 2025ICMI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose and explore a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative models. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual data. Our findings show that synthetic images benefit training data efficiency with missing visual data during training and improve model robustness with visual data missing during both training and testing. Moreover, we demonstrate GTI-MM is effective with lower generation quantity and simple prompt techniques. Our code base and synthetic images are at https://github.com/usc-sail/GTI-MM.
Loading