Linear RNNs for autoregressive generation of long music samples

Published: 23 Sept 2025, Last Modified: 08 Nov 2025AI4MusicEveryoneRevisionsBibTeXCC BY 4.0
Keywords: recurrent neural network, seq2seq, music, ssm, rnn, linar rnn
TL;DR: We demonstrate that linear RNNs with pooling can generate nice one-minute music samples, achieve SOTA log-likelihood on small-scale datasets.
Abstract: Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolution and self-attention, have had only limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
(Optional) Supplementary Material: zip
Submission Number: 41
Loading