Next-Scale Autoregression on Spectrograms for Sound Generation

Published: 02 Mar 2026, Last Modified: 10 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: tiny paper (up to 4 pages)
Keywords: Sound Generation, Autoregressive Models, Spectral Representation, Next-Scale Prediction, Efficient Computation
Abstract: Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), which, to the best of our knowledge, is the first adaptation of next-scale autoregressive modeling to the spectrogram domain. MARS treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping strategy that reduces spatial resolution without information loss. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity sound generation.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading