MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction

Shilong Yu, Chenhui Yang

Published: 2024, Last Modified: 02 Apr 2026MMM (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speaker extraction to separate the target speech from the mixed audio is a problem worth studying in the speech separation field. Since human pronunciation is closely related to lip motions and facial expressions during speaking, this paper focuses on lip motions and their relationship to pronunciation and proposes a multi-scale audio-visual association representation network for end-to-end speaker extraction (MAVAR-SE). Moreover, multi-scale feature extraction and jump connection are used to solve the problem of information loss due to the lack of memory ability of convolution. This method is not limited by the number of speakers in the mixed speech and does not require prior knowledge such as speech features of related speakers, to realize speaker-independent multi-modal time domain speaker extraction. Compared with other recent methods on VoxCeleb2 and LRS2 data sets, the proposed method shows better results and robustness.