Keywords: foley generation, controllable generation, fast generation, interactive sound generation, real-time audio
Abstract: Despite the growth of Text-to-Audio (TTA) models for creative applications like sound design and live jamming, existing systems, particularly in the open-source, lack the ability for flexible fine-grained control (such as vocal “sketches") while maintaining fast inference speeds for real-time interaction. We address this unnecessary tradeoff between speed and control through FlashFoley, the first open-source, accelerated sketch2audio model. With FlashFoley, we extend the Sketch2Sound framework, wherein we finetune TTA models with pitch, volume, and brightness controls through simple linear adaptation, to adversarial post-training, allowing the model to generate 11s samples from audio sketches in 75ms. We combine this with a novel zero-shot chunked streaming algorithm, enabling real-time interactive generation while maintaining high-quality fast offline sampling. Audio examples can be found at https://anonaudiogen.github.io/web.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 29
Loading