Abstract: Deep network architectures are usually based on domain-specific assumptions and are specialized to the modalities under consideration. This conceptual behavior also applies to multimodal networks, leading to modality-specific subnetworks. In this paper, we introduce a novel dynamic multi-modal and multi-instance (MM-MI) network based on Perceiver and Hopfield pooling which can learn intrinsic data fusion. We further introduce a novel composite dataset for evaluating MM-MI problems. We successfully show that our proposed architecture outperforms the late fusion baseline in all multi-modal setups by more than 40% accuracy on noisy data. Our simple, generally applicable, yet efficient architecture is a novel generalized approach for data fusion with high potential for future applications.
Loading