Abstract: Multimodal feature fusion aims to draw
complementary information from different modalities to
achieve better performance. Contrastive learning is effective
at discriminating coexisting semantic features (positive) from
irrelative ones (negative) in multimodal signals. However, positive
and negative pairs learn at separate rates, which undermines the
overall performance of multimodal contrastive learning (MCL).
Moreover, the learned representation model is not robust, as MCL
utilizes supervision signals from potentially noisy modalities. To
address these issues, a novel multimodal contrastive learning
objective, Pace-adaptive and Noise-resistant Noise-Contrastive
Estimation (PN-NCE), is proposed for multimodal fusion by
directly using unimodal features. PN-NCE encourages the
positive and negative pairs reaching to their optimal similarity
scores adaptively and shows less susceptibility to noisy inputs
during training. A theoretical analysis is performed on its
robustness. Maximizing modality invariance information in the
fused representation is expected to benefit the overall performance
and therefore, an estimator that measures the difference between
the fused representation and its unimodal representations is
integrated into MCL to obtain a more modality-invariant
fusion output. The proposed method is model-agnostic and can be
adapted to various multimodal tasks. It also bears less performance
degradation when reducing the number of training samples at
the linear probing stage. With different networks and modality
inputs from three multi-modal datasets, experimental results show
that PN-NCE achieves consistent enhancements compared with
previous state-of-the-art approaches.
Loading