Content-Style Disentangled Audio Style Transfer via Diffusion Model

Published: 01 Jan 2025, Last Modified: 13 Nov 2025ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep generative models have advanced the synthesis of high-quality audio signals, shifting the focus from audio fidelity to user-specific customization. Despite significant progress, current models struggle to generate style-consistent audio. Audio style transfer offers a more intuitive approach for capturing user intent but faces challenges in the disentanglement and interpretation of content and style. This paper introduces a novel framework for content-style disentangled audio style transfer. We introduce an interpretable, formula-based style distance that effectively disentangles content and style within the language-audio feature space. The proposed QwenAudio-Contrastive Language Audio Pretraining (Qwen-CLAP) content extraction module and the CLAP-based style disentanglement loss coordinated with the style reconstruction loss, enable interpretable disentanglement and stylization. Comprehensive experiments on our new dataset, BBCreatures, demonstrate superior stylization quality, preserving fine style details and original content.
Loading