Abstract: Speaker extraction aims to selectively extract the target speaker from the multi-talker environment under the guidance of auxiliary reference. Recent studies have shown that the attended speaker's information can be decoded by the auditory attention decoding from the listener's brain activity. However, how to more effectively utilize the common information about the target speaker contained in both electroencephalography (EEG) and speech is still an unresolved problem. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to make full use of the speech information, the mixed speech is encoded with multiple time scales so that the multi-scale embeddings are acquired. In addition, to effectively extract the non-Euclidean data of EEG, the graph convolutional networks are used as the EEG encoder. Finally, these multi-scale embeddings are separately fused with the EEG features. To facilitate research related to auditory attention decoding and further validate the effectiveness of the proposed method, we also construct the AVED dataset, a new EEG-Audio dataset. Experimental results on both the public Cocktail Party dataset and the newly proposed AVED dataset in this paper show that our MSFNet model significantly outperforms the state-of-the-art method in certain objective evaluation metrics.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: According to the latest research in neuroscience, it proves that auditory attention of listeners can be decoded from recorded brain activity. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which employs an end-to-end architecture to comprehensively extract multimodal fusion features from both electroencephalography (EEG) and speech. Without any pre-registered prior information about the identity of target speaker, the attended speech is directly filtered based on the attention information extracted from the listener's brain signals. Meanwhile, we construct a new Audio-Video EEG dataset, named the AVED dataset, which innovatively incorporates visual information, significantly advancing research in areas such as multimodal auditory attention decoding and brain-controlled speaker extraction.
Submission Number: 4596
Loading