Guiding Audio Editing with Audio Language Model

Published: 24 Sept 2025, Last Modified: 07 Nov 2025NeurIPS 2025 Workshop GenProCCEveryoneRevisionsBibTeXCC BY 4.0
Track: Regular paper
Keywords: Audio ediitng, Diffusion model, Audio Languagde Model
Abstract: Audio editing is increasingly important in immersive applications such as VR/AR, virtual conferencing, and sound design. While diffusion-based models have enabled language-driven audio editing, existing methods rely on predefined instruction formats and are limited to mono-channel audio. In this work, we introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capabilities of Audio Language Models (ALMs) with the generative power of latent diffusion. Given a high-level prompt, SmartDJ decomposes it into a sequence of atomic editing steps, which are executed sequentially by a conditional diffusion model trained to manipulate stereo audio. We also develop a scalable data synthesis pipeline that generates training samples consisting of a high-level instruction, a sequence of atomic edits, and the corresponding audio at each step of the editing process. Experiments show that SmartDJ outperforms prior methods in perceptual quality, spatial coherence, and alignment with complex user instructions.
Submission Number: 3
Loading