# SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

## Install
The requirements can be installed with:
```shell
pip install -r requirements.txt
```
Additional requirements for Stable Audio Open and ZETA (DDPM Inversion) need to be installed in their respective repositories (see Data Generation).


## Data Generation

### Prompt Generation
The script `generate_prompts.py` can be used for the prompt generation. It except a `.jsonl` file as an input that has the following form:
```json lines
{"caption":  "Audio Caption", "metadata": {}}
```
If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename.


### Paired Sample Generation

#### Prompt-to-Prompt
After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs.
The Prompt-to-Prompt pipeline consists of two parts:
- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file
- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step. 

Use the script `generate_candidates.py` for the candidate search.
The script `generate_samples.py` can be used for the Prompt-to-Prompt sample generation. Make sure to use the mode `p2p`.
We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools`.
You can install its requirements using:
```shell
cd audio_generation/p2p/stable-audio-tools && pip install .
```
Make sure that the `k_diffusion` package is configured to use the same initial starting noise. Change the ending of the function `sample_dpmpp_3m_sde` in `k_diffusion/sampling.py` file to:
```python
if eta:
	noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0) 
	noise = noise.repeat(x.shape[0], 1, 1) 
 
	x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
```

#### DDPM Inversion
The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`).
We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/).
Clone the repository using
```shell
cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
pip install -r requirements.txt
```

#### Manual Edits
For generating manual edits, use the script `manual_edits/generate_manual_samples.py`.

## SAO-Instruct
After generating the dataset of audio editing triplets, you can fine-tune Stable Audio Open by following the [official guidelines](https://github.com/Stability-AI/stable-audio-tools).
In the `finetuning` folder, we provide updated training and data loading scripts to enable fine-tuning on audio editing triplets.
Additionally, we provide sample configs for the model and dataset.
After training, you can load the model using `finetuning/sao_instruct.py`. We are working on providing easier inference using Hugging Face.

