Abstract: Highlights•We study modality synergy and temporal context in multi-modal action recognition.•We propose M-Mixer, which employs MCU to model cross-modal temporal relationships.•MCU captures relations between one modality’s sequence and others’ action content.•Furthermore, we introduce CFEM and multi-modal feature bank.•M-Mixer achieves state-of-the-art performance on three benchmark datasets.
Loading