Keywords: music generation, flow matching, generative models
TL;DR: We train a real time audio model end-to-end and set it up to support freestyle control in latent space (look Ma, no prompt!)
Abstract: Live neural audio generation has the potential to enable new modes of performance and experimentation in music and sound art. The latest generation of proprietary audio models boast high perceptual quality, text and sample prompting, and in some cases live generation. However, these services are fundamentally limited. Remote hosting obstructs seamless interaction and customization, while prompt-based control pigeonholes outputs into existing descriptors and sounds. In this work, we describe Autoencoding Sequentially Unrolled Amortized Flow (ASUrA-Flow), a generative audio model capable of live output on local hardware. Inspired by the emergent latent codes of generative adversarial networks and variational autoencoders, ASUrA-Flow is designed to be played in real time by directly modulating its latent control vector without text or audio prompts. We train our architecture end-to-end on raw audio using amortized flow matching, a novel distribution-matching objective that provides stable training and efficient, high fidelity output directly in signal space.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
(Optional) Supplementary Material: zip
Submission Number: 96
Loading