Abstract: Highlights•Construct modality-shared and modality-specific encoders that effectively learn shared and specific feature representations of modalities.•Propose an end-to-end multimodal representation and fusion method for multimodal intent recognition.•Propose an adaptive multimodal fusion method based on the attention-based gated neural network, which can distinguish the contributions of different modalities and reduce possible noise.•Experimental results show that the model outperforms state-of-the-art models on multiple evaluation metrics.
Loading