MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.
Relevance To Conference: This work proposes a Model-Agnostic Joint Learning (MAJL) framework for both Music Source Separation (MSS) and Pitch Estimation (PE) tasks. The MAJL framework advances multimedia processing by enabling music source separation and pitch estimation tasks. MSS extracts isolated sources from mixture music, aiding multimedia content understanding. PE predicts pitch values, which are also corresponding to notes, crucial for music analysis. These contributions of MAJL enhance music understanding, aligning with multimedia interpretation. Extracted pitches and sources benefit singing synthesis and music generation, highlighting its multimodal impact. This work aligns with evolving music production and computational audio analysis, advancing multimedia processing and creative applications.
Supplementary Material: zip
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Media Interpretation, [Experience] Art and Culture, [Experience] Multimedia Applications
Submission Number: 2080
Loading