Abstract: Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As
an important subfield, video-language representation learning focuses on learning representations using global semantic interactions
between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions
are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players
using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity,
flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained
correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations
within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components.
This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving
the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible
encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used
text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness
and generalization of our method. The code is available at https://github.com/jpthu17/HBI.
Loading