ZeroSep: Separate Anything in Audio with Zero Training

Anonymous Submission to NeurIPS 2025
Paper ID: 13680

Overview

ZeroSep is a novel approach for audio source separation that requires zero training. Users can specify any sound source they want to separate using natural language prompts. Our method leverages pretrained audio diffusion models to achieve separation without requiring specialized training data or fine-tuning.

Interactive Demo

This interactive demo showcases our user-friendly interface for real-world audio source separation. Users can isolate any sound source through natural language prompts, and ZeroSep performs zero-shot separation without any training requirements. We will open-source all code and demo implementations upon publication.

ZeroSep interface demonstration: separating audio sources using natural language prompts

Comparison with State-of-the-Art Methods

Below are comparison results on the MUSIC and AVE datasets. We compare ZeroSep against training-based methods (LASS-Net, AudioSep, FlowSep) and training-free methods (AudioEdit and ZeroSep (Ours)). The natural language prompt used for source specification is shown at the beginning of each row.

Results
Prompt
Mixture
Ground Truth
LASS-Net
(Training-based)
AudioSep
(Training-based)
FlowSep
(Training-based)
AudioEdit
(Training-free)
ZeroSep
(Ours, Training-free)
"Extract the tuba"
"Extract the cello"
"Extract the accordion"
"Extract the flute"
"Extract the xylophone"
"Extract the chainsaw"
"Extract the frying food"
"Extract the Banjo"
"Extract the Speech"
"Extract the Speech"
"Extract the Church Bell"