Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation

Yanli Ji, Shuo Ma, Xing Xu, Xuelong Li, Heng Tao Shen

Published: 2023, Last Modified: 13 Nov 2024IEEE Trans. Multim. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio mixture separation is still challenging due to heavy overlaps and interactions. To correctly separate audio mixtures, we propose a novel self-supervised Fine-grained Cycle-Separation Network (FCSN) for vision-guided audio mixture separation. In the proposed approach, we design a two-stage procedure to perform self-supervised separation on audio mixtures. Using visual information as guidance, a primary-stage separation is realized via a U-net network, then the residual spectrogram is calculated by removing separated spectrograms from the original audio mixture. At the second-stage separation, a cycle-separation module is proposed to refine separation using separated results and the residual spectrogram. Self-supervision learning between vision and audio modalities is presented to push the cycle separation until the residual spectrogram becomes empty. Extensive experiments are evaluated on three large-scale datasets, MUSIC (MUSIC-21), AudioSet, and VGGSound. Experiment results certify that our approach outperforms the state-of-the-art approaches, and demonstrate the effectiveness for separating audio mixtures with overlap and interaction.