Abstract: Multimodal sequence learning aims to utilize
information from different modalities to enhance overall
performance. Mainstream works often follow an intermediatefusion pipeline, which explores both modality-specific and
modality-supplementary information for fusion. However, the
unaligned and heterogeneously distributed multimodal sequences
pose significant challenges to the fusion task: 1) to extract
both effective unimodal and crossmodal representations and 2)
to overcome the overfitting issue in joint multimodal sequence
optimization. In this work, we propose regularized expressive
representation distillation (RERD) that aims to seek effective
multimodal representations and to enhance the generalization
of fusion. First, to improve unimodal representation learning,
unimodal representations are assigned to multi-head distillation
encoders, where the unimodal representations are iteratively
updated through distillation attention layers. Second, to alleviate
the overfitting issue in joint crossmodal optimization, a multimodal
sinkhorn distance regularizer is proposed to reinforce the
expressive representation extraction and to reduce the modality
gap before fusion adaptively. These representations produce a
comprehensive view of the multimodal sequences, which are
utilized for downstream fusion tasks. Experimental results on
several popular benchmarks demonstrate that the proposed
method achieves state-of-the-art performance, compared with
widely used baselines for deep multimodal sequence fusion, as
shown in https://github.com/Redaimao/RERD.
Loading