Will a Blind Model Hear Better? Advanced Audiovisual Recognition System with Brain-Like Compensating and Gating
Keywords: brain-inspired computing, multi-modal neural network, audio-visual speech recognition
Abstract: Multi-modal data (e.g., audio-visual inputs, various medical images) fusion neural networks has draw more attention recently with growing number of models and training techniques being proposed. Despite the success of the multi-modal fusion neural network, we find a interesting "low single-modality robustness" phenomenon. Specifically, a multi-modal trained model may achieve worse performance than single-modal trained model if another modal data are masked. This is like a born blind or deaf person (single-modal trained) surpass the healthy one (multi-modal trained) with only one modality data input, and the multi-modal experience becomes a bias causing negative transfer. It shows that the existing neural networks have lower robustness than the human brain in terms of modal-missing problem. To overcome the defect, in this paper we design a brain-like neural network modeling the processing of audio and visual signals by training it to perform audiovisual speech recognition tasks. Our results demonstrate the computational model's vulnerability to sensory deprivation while promoting this adaption can help in multi-modal processing. Besides, we propose modality mix and gated fusion techniques to get a more robust model with better generalization ability. We ask for more attention on the interaction of signals of different modalities and hope our work will inspire more researchers to study the cross-modal complementary.
One-sentence Summary: We find that multi-modal models failed generalized better to single-modal tasks with or without transfer learning and we train a better multi-modal neural networks using compensating and gating inspired by the brain.
5 Replies
Loading