Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation

Published: 10 Oct 2024, Last Modified: 30 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Disentangled Representation Learning; Autoencoders; Diffusion Models; Audio Transformation
TL;DR: We propose a framework to disentangle pitch and timbre of musical instruments from mixtures, first implementing a simple autoencoder for source-level attribute manipulation, and scaling up to complex datasets with a diffusion transformer.
Abstract: Disentangling pitch and timbre from the audio of a musical instrument involves encoding these two attributes as separate latent representations, allowing the synthesis of instrument sounds with novel attribute combinations by manipulating one representation independently of the other. Existing solutions have mostly focused on single-instrument audio, excluding the cases where multiple sources of instruments are presented. To fill the gap, we aim to disentangle multi-instrument mixtures by extracting per-instrument representation that combines the pitch and timbre latent variables. These latent variables construct a set of modular building blocks that is used to condition a decoder to compose new mixtures. We first present a simple implementation to verify the framework using structured and isolated chords. We then scale up to a complex dataset of four-part chorales by a model that jointly learns the latents and a diffusion transformer. Our evaluation identifies the key components for the success of disentanglement and demonstrates the application of mixture transformation based on source-level attribute manipulation. Audio samples are available at https://yjlolo.github.io/dismix-audio-samples.
Submission Number: 12
Loading