Listens like Mel: Boosting Latent Audio Diffusion with Channel Locality

Listens like Mel: Boosting Latent Audio Diffusion with Channel Locality

ICLR 2026 Conference Submission25414 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio generation, variational auto-encoder, representation learning, self-supervised learning

TL;DR: Channel span masking imposes mel-like spectral bias on high-compression VAE latents by acting as a low-pass window over channels, restoring power-law structure and delivering up to 4× faster Diffusion Transformer convergence.

Abstract: Latent representations critically shape diffusion-based audio generation. We observe that Mel spectrograms exhibit an approximate power-law spectrum that aligns with diffusion’s coarse-to-fine denoising, whereas waveform variational autoencoder (VAE) latents are nearly equal intensity along the channel axis. We introduce channel-span masking, which in expectation behaves like a rectangular window across channels and thus a low-pass filter in the channel-frequency domain, increasing channel locality. The induced locality steepens latent spectral slopes toward a power-law distribution and leads to up to 4× faster convergence of Diffusion Transformer (DiT) training on audio generation tasks, while maintaining reconstruction fidelity and compression. Experimental results show that the model performs comparably to, or better than, competitive baselines under the same conditions. Our code and checkpoint are available at \url{https://anonymous.4open.science/r/lafa-F2A2}.

Primary Area: generative models

Submission Number: 25414

Loading