# ICLR 2026 Supplementary Audio Examples

This repository contains supplementary audio outputs for our ICLR 2026 submission:  
**_"CodecSep:  Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents"_**

Our proposed model, **CodecSep**, performs text-guided source separation using neural audio codec representations. This supplementary material demonstrates the model's qualitative performance across various prompts and mixtures present in the **[Divide and Remaster (dnr-v2)](https://zenodo.org/records/6949108)** test set.

 For comparison, we include outputs from the publicly available **AudioSep** model. While we retrained AudioSep on the dnr-v2 dataset for controlled benchmarking in the main paper, the **pretrained AudioSep model achieved higher SI-SDR and ViSQOL** on the dnr-v2 test set. Accordingly, all AudioSep outputs presented here are generated using the **official pretrained model**.


---

## 📝 Notes

- These examples are generated using our **CodecSep** model trained on the [Divide and Remaster (dnr-v2)](https://zenodo.org/records/6949108) dataset.
- All clips correspond to **model inferences on the dnr-v2 test set**.
- We present separated outputs for **20 random test mixtures** which we used in our subjective evaluation (MOS-LQS).
- We use fixed prompts `"speech"` and `"music"` to extract those respective sources.
- The prompt used to condition **SFX separation** is stored in a `.txt` file named `<clip_id>_sfx-prompt.txt`.
- **CodecSep outputs** are in **16-bit mono WAV format**, sampled at **16 kHz**, with a fixed duration of **10 seconds**.
- **AudioSep outputs** are in **16-bit mono WAV format**, sampled at **32 kHz**, with a fixed duration of **10 seconds**.   




---

## 📁 Directory Structure

Each folder (e.g., `720/`, `1230/`) corresponds to a single test example and contains:




```
<clip_id>/
├── reference/
│   ├── mix-ref-<clip_id>.wav         # Input mixture
│   ├── speech-ref-<clip_id>.wav      # Ground truth speech
│   ├── music-ref-<clip_id>.wav       # Ground truth music
│   ├── sfx-ref-<clip_id>.wav         # Ground truth SFX
│
├── estimate/
│   ├── speech_sep-est-<clip_id>.wav  # Predicted speech source (CodecSep)
│   ├── music_sep-est-<clip_id>.wav   # Predicted music source (CodecSep)
│   ├── sfx_sep-est-<clip_id>.wav     # Predicted SFX source (CodecSep)
│
├── estimate-audiosep/
│   ├── speech_sep-est-audiosep-<clip_id>.wav  # Predicted speech source (AudioSep)
│   ├── music_sep-est-audiosep-<clip_id>.wav   # Predicted music source (AudioSep)
│   ├── sfx_sep-est-audiosep-<clip_id>.wav     # Predicted SFX source (AudioSep)
│
├── <clip_id>_sfx-prompt.txt          # Prompt text used during sfx separation
```

---
## 🔊 Listening Guide

Each example includes three source types: `speech`, `music`, and `sfx`.


For each `<clip_id>`:
- The `reference/` folder contains the original mixture and ground-truth source components.
- The `estimate/` folder contains CodecSep predictions for each source.
- The `estimate-audiosep/` folder contains AudioSep predictions for the same sources.
- Use the `.txt` file in each folder to see the exact prompt used for SFX separation.

This layout enables direct comparison of **CodecSep** and **AudioSep** for the same mixture and prompt context.

We further provide an subjective assessment comparing the outputs of **CodecSep** and **AudioSep**. This assessment is intended to assist reviewers by highlighting typical separation characteristics, including leakage, source clarity, and overall perceptual quality across music, speech, and SFX. The results are documented in the file: `subjective_assessment.md`.

---

For questions or clarifications, feel free to contact us.







