Keywords: audio generation, variational auto-encoder, representation learning, self-supervised learning
TL;DR: Channel span masking imposes mel-like spectral bias on high-compression VAE latents by acting as a low-pass window over channels, restoring power-law structure and delivering up to 4× faster Diffusion Transformer convergence.
Abstract: Latent representations critically shape diffusion-based audio generation. We observe that Mel spectrograms exhibit an approximate power-law spectrum that aligns with diffusion’s coarse-to-fine denoising, whereas waveform variational autoencoder (VAE) latents are nearly equal intensity along the channel axis. We introduce channel-span masking, which in expectation behaves like a rectangular window across channels and thus a low-pass filter in the channel-frequency domain, increasing channel locality. The induced locality steepens latent spectral slopes toward a power-law distribution and leads to up to 4× faster convergence of Diffusion Transformer (DiT) training on audio generation tasks, while maintaining reconstruction fidelity and compression. Experimental results show that the model performs comparably to, or better than, competitive baselines under the same conditions. Our code and checkpoint are available at \url{https://anonymous.4open.science/r/lafa-F2A2}.
Primary Area: generative models
Submission Number: 25414
Loading