Abstract: Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.
Lay Summary: Multimodal learning focus on combining information from different types of data, like images, text, or sound. A major challenge is making sure the system can effectively handle and combine these different types, especially when the quality of the data varies. Many models, such as those using Transformers with attention mechanisms, try to tackle this by adaptively emphasizing modalities based on the characteristics of input data.
However, we discovered that these models can sometimes get stuck in a pattern, where the model starts favoring one type of data too much, regardless of data characteristics. This creates a "self-reinforcing cycle" where the model progressively overemphasizes the favored modality, making it less adaptable and limiting its ability to handle different inputs.
To fix this issue, we proposed a new approach called Rolling Query (RollingQ). This method helps balance the model's attention across different types of data by rotating its focus, breaking the self-reinforcing cycle, and ensuring the system works better overall. We tested RollingQ across various scenarios and found that it significantly improves the model’s performance and ability to combine different data types effectively.
Link To Code: https://github.com/GeWu-Lab/RollingQ_ICML2025
Primary Area: Deep Learning
Keywords: Multimodal Learning, Dynamic Fusion, Modality Imbalance
Submission Number: 5820
Loading