CodecSep:  Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

26 Jan 2026 (modified: 03 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text-guided sound separation enables flexible audio editing and assistive applications, but existing open-domain systems such as AudioSep remain too compute-intensive for low-latency edge or codec-mediated deployment. Neural audio codec (NAC)-based separators such as CodecFormer and SDCodec are more efficient, but they are largely restricted to fixed-class or fixed-stem separation. We introduce \textbf{CodecSep}, a \emph{text-guided universal sound separation} framework that operates directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight Transformer \emph{masker} conditioned by CLAP-derived FiLM parameters, enabling open-vocabulary source extraction while preserving the efficiency advantages of codec-native representations. To our knowledge, this is the first prompt-driven universal sound separation system built directly on NAC latents. Across \textbf{dnr-v2} and five additional open-domain benchmarks under matched training and prompting protocols, CodecSep consistently improves over AudioSep in separation fidelity (\textbf{SI\mbox{-}SDR}) while remaining competitive in perceptual quality (\textbf{ViSQOL}), and also shows gains in human \textbf{MOS--LQS}. Further analyses show that finer-grained semantic supervision improves separation more consistently than coarse prompting, and that \emph{explicit masking} is more effective than decoder-style latent generation for codec-domain source separation. Qualitative and diagnostic analyses further support the central design premise: modern NAC latents preserve meaningful \emph{source-dependent structure}, and the learned masks exploit this structure primarily through \emph{channel-wise modulation}, indicating that source extraction can be performed through masking alone without explicit latent generation. From a systems perspective, CodecSep also provides a concrete \emph{deployment path} for codec-mediated audio processing. In deployment-typical \emph{code-stream} settings, where the edge device transmits audio as NAC codes generated by the same codec backbone used by the separator, the server can map the received codes to codec embeddings through codebook lookup and perform separation directly in codec space, avoiding a separate decode--separate--re-encode cycle. In this regime, CodecSep requires only \textbf{1.35~GMACs} end-to-end—about $\mathbf{54\times}$ less compute than AudioSep in the same codec-mediated pipeline (and about $\mathbf{25\times}$ lower separator-only compute)—while also reducing latency and memory footprint substantially and remaining fully compatible with \emph{codes in: codes out} operation. More broadly, this codes-in / codes-out formulation provides a concrete blueprint for \emph{codec-native downstream audio processing}, suggesting that tasks such as enhancement, denoising, dereverberation, and prompt-guided audio editing can be designed to operate directly on NAC representations rather than repeatedly decoding to waveform and re-encoding after each processing stage.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: The main changes are summarized below. 1. **Method reorganization and clearer notation.** Section 3 has been reorganized into task formulation, model architecture, training objective, and deployment discussion. We also made the pipeline more explicit by specifying the DAC latent $Z \in \mathbb{R}^{d\times T}$, point wise convolutional interface between NAC latents and masker, transformer width, projection layers, FiLM injection points, query network, and mask head. 2. **Stronger justification of FiLM design.** We expanded the discussion of why FiLM is applied inside the masker and why we use Post-LN FiLM instead of AdaLN. Since CodecSep performs masking-based source selection rather than broad conditional generation, we argue that conditioning should act directly on latent features that must be preserved or suppressed. 3. **Clarified FiLM parameterization and stability.** We now explicitly state that CodecSep uses additive FiLM only with fixed scale $(\gamma^l=\mathbf{1})$, with Xavier uniform weight initialization and zero bias, so that the conditioned masker begins as an identity-like unmodulated masker. 4. **Experiments reorganized around explicit questions.** Section 4 has been restructured so that each experimental block begins with a question-style heading and directly answers a specific hypothesis. 5. **Added stronger statistical analysis.** We added paired statistical analysis for the additional benchmark results, including mean paired gains, 95\% confidence intervals, paired $t$-tests, and Wilcoxon signed-rank tests. We also added significance analysis for the MOS-LQS subjective evaluation, showing statistically significant gains overall and for music, speech, and SFX individually. 6. **Clearer framing of comparison fairness.** We revised the paper to make clear that the main comparison is a codec-mediated / compressed-input deployment comparison. We also added a new completeness table reporting non-codec baselines in their native raw-audio setting, and explicitly clarify that our claim is deployment-regime-specific rather than universal. 7. **Direct qualitative evidence for structured codec latent space.** We added a new qualitative analysis section and figure showing the mixture latent, reference source latents, estimated source-conditioned masks, and a residual analysis. These results support the claim that frozen codec latents preserve source-dependent structure and that CodecSep operates mainly through channel-wise latent reweighting. 8. **Oracle codec reconstruction analysis.** We added a new oracle analysis comparing oracle codec reconstruction against separated outputs to distinguish codec-induced distortion from separation-specific error. 9. **Refined efficiency claims.** The reported $\sim54\times$ efficiency gain was always intended specifically for the **code-stream deployment regime**, not as a universal end-to-end raw-audio claim. In the revised manuscript, we make this scope more explicit and distinguish more carefully between **systems-level deployment efficiency** and **architecture-only comparisons**. 10. **Bandwidth-scaling and sampling-rate discussion.** To address sampling-rate mismatch concerns, we added a dedicated **bandwidth-scaling study** over multiple codec backbones and clarified why AudioSep is kept in its standard 32 kHz form while CodecSep uses a 16 kHz DAC backbone as its primary low-MAC operating point. 11. **Clarified subjective evaluation protocol.** We explain that the MOS comparison used the official pretrained AudioSep checkpoint, since it was stronger than our retrained version on dnr-v2 and therefore provides a more conservative baseline. 12. **Clarified reporting choices and limitations.** We now explain why **dnr-v2** is reported with per-source scores, whereas the open-domain benchmarks are reported using dataset-level averages. We also moved key limitations into the main paper and expanded the discussion of failure cases, including training scale / prompt diversity limits, lack of temporal-relational prompt evaluation, SFX perceptual gaps, and residual overlap due to channel-wise masking. 13. **Broader impact and softened claims.** We added a Broader Impact section and revised claim wording throughout to avoid overstatement, especially regarding on-device claims. 14. **Expanded future work.** We expanded the discussion of future directions into a section. + 15. **Revised prompt and appendix analyses.** We softened overstrong claims in the prompt-granularity discussion, clarifying that **finer-grained semantic supervision improves separation more consistently**. We also reorganized the appendix around explicit hypotheses, including a dedicated **cross-domain stress-test / failure-case analysis on AudioCaps**, and clarified the reconstruction / masker-ablation results, showing that FiLM-only encoder conditioning supports source-consistent reconstruction, while an explicit masker is required for source selection from mixtures.

Assigned Action Editor: ~Yossi_Adi1

Submission Number: 7175

Loading