RevNet: A Review Network with Group Aggregation Fusion for Singing Melody Extraction

Shuai Yu, Xiaoliang He, Yanting Zhang

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Singing melody extraction (SME) is a critical task in the field of music information retrieval (MIR). Recently, deep learning based methods have achieved remarkable successes for singing melody extraction. However, most of the existing models are based on stacked convolution layers to progressively obtain task-specific features. Such an architecture has two limitations: 1) in the training stage, when the global semantic feature is obtained, the global semantic feature will be directly used to make predictions. There is a lack of a process that makes SME models learn knowledge from training errors in an in-depth way. 2) there exist semantic gaps between features from different levels in the prior SME models. The inconsistent features from different levels may cause suboptimal performances. To address the above mentioned problem, in this paper, we propose a review network (RevNet) with group aggregation fusion for singing melody extraction. Specifically, the proposed network is based on an encoder-decoder network, which consists of two modules: review module and group aggregation fusion (GAF) module. The review module aims to make the model be able to learn training errors interactively. We design multiple review modules to iterately review the training errors. The design of this module is like a review process to force the model to learn knowledge from prior prediction errors. The GAF module aims to fuse the features from different levels and makes multi-level features complementary. A set of dilated convolution operations are performed on our designed grouped high-level and low-level features. Moreover, to explicitly eliminate the difference between multi-level features, the feature maps from different levels are allowed to be directly supervised by the ground truth. We conduct experiments on several public datasets and the promising results demonstrate the effectiveness of our proposed method.