TL;DR: Modality collapse happens as a result of cross-modal polysemantic entanglements arising out of rank bottlenecks in deep multimodal models, and can thus be remedied by freeing up such bottlenecks.
Abstract: We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
Lay Summary: Even though multimodal models may be fed with inputs from various data sources (modalities), it has recently been observed that they might not actually be utilizing all of them, leading to a phenomenon called modality collapse. This is wasteful in several ways - unused data, unused model parameters, redundant computation, and much more, leading to high overall costs for little to no return. It thus becomes important to understand the root causes of modality collapse, so that we may find ways to mitigate it.
We show that modality collapse happens when noisy features from one modality get jointly encoded with predictive features from another, through a shared set of neurons. It's like cooking a soup where some ingredients are fresh, while the others have gone bad. Putting them all together in the same bowl makes the whole bowl of soup inedible! We show that knowledge distillation implicitly separates out those bad ingredients (noisy features) by utilizing previously empty, unused bowls from the cupboard (freeing up rank bottlenecks), prior to putting the soup mix together (multimodal fusion). Based on this, we propose a technique called Explicit Basis Reallocation, which lays the empty bowls (unused dimensions in the representation space) out in the kitchen table, so that the cook (SGD) does not have to look for them in the cupboard while cooking (optimization). With this, we were successfully able to mitigate modality collapse, which led to significant performance improvements in scenarios where certain modalities could randomly go missing during inference.
Link To Code: https://abhrac.github.io/mmcollapse/
Primary Area: Theory->Domain Adaptation and Transfer Learning
Keywords: Multimodal learning, modality collapse
Submission Number: 421
Loading