# DiffRes: Learning differentiable temporal resolution on audio spectrogram

The open-source code is still under development, please refer to this [anounymous link](https://anonymous.4open.science/r/differentiable_temporal_resolution_on_spectrogram-0EBD/README.md) for the most up-to-date code.

The code implementation of DiffRes is the python class *Proposed* in the **src/models/diffres.py**.


# Conda environment

Please ensure the conda environment is properly installed.

Then install the environment with the following conda command.

```shell
# Create a new conda environment diffres
conda env create --name diffres --file=env.yml
```

Activate the environment

```shell
conda activate diffres
```

# Audio classifications

For each dataset, we will have an extry script (run.sh) to run the experiment. Please ensure the dataset is ready before using run.sh (please refer to the next section). The input argument of each run.sh is the same, given by

```shell
# <str, algorithm>: the dimension reduction method ["DoNothing","ChangeHopSize","AvgPool","ConvAvgPool", or "Proposed"]
# <float, dimension-reduction-rate>: the dimension reduction rate (delta).
# <int or float, hopsize_ms>: the hop size you would like to use when calculating mel-spectrogram.
source run.sh <str, algorithm> <float, dimension-reduction-rate> <int or float, hopsize_ms>
```

**Take the SpeechCommands datset as an example**

1. First, download and prepare the dataset metadata
```shell
cd egs/speechcommands
python3 prep_sc.py
```

2. After a while, the dataset will be ready. Then you are start running the experiments:

```shell
# Baseline
source run.sh DoNothing 0.0 10
# Change hop size with 25% temporal dimension reduction, 10 ms hop size
source run.sh ChangeHopSize 0.25 10
# AvgPool with 25% temporal dimension reduction, 10 ms hop size
source run.sh AvgPool 0.25 10
# ConvAvgPool with 25% temporal dimension reduction, 10 ms hop size
source run.sh ConvAvgPool 0.25 10
# run the mel-spectrogram (10 ms hop size) based DiffRes algorithm with 25% temporal dimension reduction.
source run.sh Proposed 0.25 10
# run the mel-spectrogram (3 ms hop size) based DiffRes algorithm with 66% temporal dimension reduction.
source run.sh Proposed 0.66 3
# run the mel-spectrogram (3 ms hop size) based DiffRes algorithm with 66% temporal dimension reduction.
source run.sh Proposed 0.66 3
```

# Dataset preparation

The directory to each datasets:
```shell
# AudioSet tagging 
cd egs/audioset
# ESC50 classification
cd egs/esc50
# FSD50K tagging
cd egs/fsd50k
# SpeechCommands classification
cd egs/speechcommands
# Music instrument classification on NSynth
cd egs/nsynth_instrument
```

You can see a run.sh file in each directory. But you need to properly prepare the dataset first before running. The orgnization of the metadata in each dataset is quite similar.

## AudioSet tagging

Please refer to [this repo](https://github.com/YuanGongND/psla) to prepare the AudioSet metadata json file.

## FSD50K tagging

Please run the following script to prepare FSD50K
```shell
cd egs/fsd50k/
python3 prep_fsd.py
```

## SpeechComands

Please run the following script to prepare SpeechCommands dataset
```shell
cd egs/speechcommands
python3 prep_sc.py
```

## ESC50
Please run the following script to prepare SpeechCommands dataset
```shell
cd egs/esc50
python3 prep_esc50.py
```

## Music instrument classification on NSynth
Please refer to the following script to prepare NSynth dataset
```shell
cd egs/nsynth_instrument
python3 prep_nsynth_instrument.py
```


Note: This code repo borrow the structure from [this open-sourced repo](https://github.com/YuanGongND/psla).