One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer

Published: 10 Oct 2024, Last Modified: 31 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neural Audio Synthesis, Diffusion Transformer, Disentangled Representation Learning.
Abstract: Despite the success of emerging text-to-music models based on deep generative approaches in generating music clips for general audiences, they face significant limitations when applied to professional music production. This paper introduces a one-shot Text-aligned Virtual Instrument Generation model using a Diffusion Transformer (TaVIG). The model integrates textual descriptions with the timbre information of audio clips to generate musical performances, utilizing additional musical structure features such as pitch, onset, duration, offset, and velocity. TaVIG comprises a CLAP-based text-aligned timbre extractor-encoder, a musical structure encoder for extracting MIDI information, and a disentangled representation learning module to ensure effective timbre and structure extraction. The audio synthesis process is based on a Diffusion Transformer conditioned with AdaLN. Additionally, we propose a mathematical framework to analyze timbre and structure disentanglement in MIDI-to-audio tasks.
Submission Number: 50
Loading