CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

ICLR 2026 Conference Submission17923 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio Source Separation, Universal Sound Separation, Text-guided Sound Separation, Neural Audio Codecs, DAC

TL;DR: We propose CodecSep, a compute-efficient, text-guided universal sound separation model that operates in neural audio codec space and outperforms prior methods like AudioSep with 25× lower compute.

Abstract: Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately 54× less compute (25× architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17923

Loading