# Music transcription

Curent version of Jointist has the following functions:
1. Instrument Recogition
1. Multi-Instrument Transcription
1. Music Source Separation

This is a end2end version of Jointist, for the original version please refer to the tag v1.0.

<img src="./model_fig.png" width="400">

Demonstration of Jointist is avaliable at [here](https://bytedance.feishu.cn/docx/doxcnI1D5cMsinM9VOjdFGliX9f)
## Setup
This code is developed using the docker image `nvidia/cuda:10.2-devel-ubuntu18.04` and python version 3.8.10.

To setup the environment for joinist, install the dependies
```bash
pip install -r requirements.txt
```

If you get `OSError: sndfile library not found`, you need to install `libsndfile1` using

```bash
apt install libsndfile1
```

<!-- It will download model weights, demo songs, and Anaconda, and then install the required dependencies into a jointist environment.

The model weights and demo songs are located in `weights` and `songs` folder respectively. -->

You need to download the `weights` and `songs` folder and put it under the root of this repository.

`weights` contains all the pre-trained model weights.

`songs` contains the demo audio for transcription and source separation.

The setup script will infer the demo songs automatically into `outputs/YYYY-MM-DD/HH-MM-SS/MIDI_output` folder.

## Transcribing your own songs
### a. Instrument Recognition + Transcription
Once you are inside the `jointist` environment, you can using the following command to transcribe your own songs.
```bash
python pred_jointist.py audio_path=songs audio_ext=mp3 gpus=[0]
```

It will first run a instrument recognition model, and the predicted instruments are used as the conditions to the transcription model.

If you have multiple GPUs, the argument `gpus` controls which GPU to use. For example, if you want to use GPU:2, then you can do `gpus=[2]`.

You can control the `audio_path` to the location where you store your audio files. If your audio files are not in `.mp3` format, you can change the `audio_ext` argument to the audio format of your songs. Since we use `torchaudio.load` to load audio files, you can used any audio format as long as it is supported by torchaudio.load.

The output MIDI files will be stored inside the `outputs/YYYY-MM-DD/HH-MM-SS/MIDI_output` folder.

Model weights can be changed under `checkpoint` of `End2End/config/jointist_inference.yaml`. `transcription1000.ckpt` is currently the best checkpoint in terms of F1 scores.

### b. Instrument Recognition + Transcription + Source Separation

Assume that you have setup the environment as described in the [raw audio](#1.-raw-audio) section. The following command transcripts + separate music pieces into the `outputs/YYYY-MM-DD/HH-MM-SS/MIDI_output` and `outputs/YYYY-MM-DD/HH-MM-SS/audio_output` respectively.

```bash
python pred_jointist_ss.py audio_path=songs audio_ext=mp3 gpus=[0]
```

Model weights can be changed under `checkpoint` of `End2End/config/jointist_ss_inference.yaml`. `tseparation.ckpt` is the checkpoint with a better transcription F1 sources and source separation SDR after training both of them end2end.

Implementational details for Jointist is avaliable [here](./jointist_explanation.md)


## pkl files to piano rolls

After transcription, you can find all the `pkl` and `midi` files inside `outputs/YYYY-MM-DD/HH-MM-SS/MIDI_output`. `pkl2pianoroll.py` is the code for converting these `pkl` files into `h5` piano rolls. 

In `pkl2pianoroll.yaml`, you can define the `audio_h5_path` to indicate where are your `h5` audio files. The audio files are required to calculate the length of piano rolls. Then you also need to set `pkl_path` to indicate the locations of your `pkl` files. Usually the pkl files are located at `outputs/YYYY-MM-DD/HH-MM-SS/MIDI_output` after running your transcription model as described [previously](#a.-Instrument-Recognition-+-Transcription).

## Using individual pretrained models
### Transcription
```
python pred_transcription.py datamodule=wild
```

Currently supported `datamodule`:
1. wild
1. h5
1. slakh
The configuration such as `path` and `audio_ext` for each datamodule can be modified inside `End2End/config/datamoudle/xxx.yaml`

## Training the model
First, set up the slakh2100 dataset by using
```
bash slakh2100_dataprocessing.sh
```


### Instrument Recognition

```bash
python train_detection.py detection=Original datamodule=slakh detection/backbone=CNN14_less_pooling epochs=50 gpus=4 every_n_epochs=2  
```

`detection`: controls the model type
`detection/backbone`: controls which CNN backbone to use
`datamodule`: controls which dataset to use `(openmic2018/slakh)`. It affects the instrument mappings.

Please refer to `End2End/config/detection_config.yaml` for more configuration parameters

### Transcrpition

```bash
python train_transcription.py transcription.backend.acoustic.type=CNN8Dropout_Wide inst_sampler.mode=imbalance inst_sampler.samples=2 inst_sampler.neg_samples=2 inst_sampler.temp=0.5 inst_sampler.audio_noise=0 gpus=[0] batch_size=2
```

`transcription.backend.acoustic.type`: controls the model type
`inst_sampler.mode=imbalance`: controls which sampling mode to use
`inst_sampler.samples`: controls how many positive samples to be mined for training
`inst_sampler.neg_samples`: controls how many negative samples to be mined for training
`inst_sampler.temp`: sampling temperature, only effective when using imbalance sampling
`inst_sampler.audio_noise`: controls if random noise should be added to the audio during training
`gpus`: controls which gpus to use. `[0]` means using cuda:0; `[2]` means using cuda:2; `[0,1,2,3]` means using four gpus cuda:0-3

Please refer to `End2End/config/transcription_config.yaml` for more configuration parameters

### End2end training (Jointist)

```
python train_jointist.py
```


## Experiments
[link](./experiments.md)