Structured Polyphonic Music Generation with Diffusion Transformer

Published: 2024, Last Modified: 15 Jan 2026GCCE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose a novel approach for structured polyphonic music generation utilizing Denoising Diffusion Probabilistic Models (DDPMs) with Transformer blocks replacing U-Net as backbone architecture. Previous methods for DDPM-based polyphonic music generation have shown success in creating diverse compositions from various prompts. However, these approaches commonly struggle with producing repetitive sections of music, leading to a lack of structural coherence in the generated music. This issue mainly stems from the limitations of the widely used U-Net architecture. Specifically, it lacks the capability to retain memory, a critical requirement for generating repeated melodic phrases. In our method, we propose a polyphonic music generation diffusion model with a Transformer backbone instead of U-Net, forming a Diffusion Transformer (DiT). Experimental results show that our method manages to create repetitive music phrases with high quality.
Loading