Abstract: In multi-modal classification tasks, a good fusion algorithm can effectively integrate and process multi-modal data, thereby significantly improving its performance. Researchers often focus on the design of complex fusion operators and have proposed numerous fusion operators, while paying less attention to the design of feature fusion usage, specifically how features should be fused to better facilitate multi-modal classification tasks. In this article, we propose a progressive skip reasoning fusion network (PSRFN) to make some attempts to address this issue. Firstly, unlike most existing multi-modal fusion methods that only use one fusion operator in a single stage to fuse all view features, PSRFN utilizes the progressive skip reasoning (PSR) block to fuse all views with a fusion operator at each layer. Specifically, each PSR block utilizes all view features and the fused features from the previous layer to jointly obtain the fused features for the current layer. Secondly, each PSR block utilizes a dual-weighted fusion strategy with learnable parameters to adaptively allocate weights during the fusion process. The first level of weighting assigns weights to each view feature, while the second level assigns weights to the fused features from the previous layer and the fused features obtained from the first level of weighting in the current layer. This strategy ensures that the PSR block can dynamically adjust the weights based on the actual contribution of features. Finally, to enable the model to fully utilize feature information from different levels for feature fusion, the skip connections are adopted between PSR blocks employing them. Extensive experiment results on six real multi-modal datasets show that a better usage for fusion operator is indeed able to improve performance.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Multi-modal fusion faces one issue: which hierarchical features should the fusion operator operate on? Therefore, this paper focuses on how to effectively fuse multi-modal features at the fusion layer. Unlike most existing multi-modal fusion methods that only use one fusion operator in a single stage to fuse all view features, our methods utilizes a progressive skip reasoning block to fuse all views with the fusion operator at each layer.
Submission Number: 3992
Loading