# Getting Started

Below we provide instructions for training and inference on audio and vision-language tasks.

We recommend that your workspace directory should be organized like this:
```
ONE-PEACE/
├── assets/
├── fairseq/
├── one_peace/
│   ├── checkpoints
│   │   ├── one-peace.pt
│   ├── criterions
│   ├── data
│   ├── dataset
│   │   ├── esc50/
│   │   ├── flickr30k/
│   ├── metrics
│   └── ...
├── README.md
├── requirements.txt
```

**Please note that if your device does not support bf16 precision, you can switch to fp16 precision for fine-tuning or inference.**
```yaml
common:
  # # use bf16
  # fp16: false
  # memory_efficient_fp16: false
  # bf16: true
  # memory_efficient_bf16: true

  # use fp16
  fp16: true
  memory_efficient_fp16: true
  bf16: false
  memory_efficient_bf16: false
```
<br>

## Pretraining
The overall pretraining process of ONE-PEACE is divided into two stages: vision-language pretraining and audio-language pretraining.

### Vision-Language Pretraining (Stage1 Pretraining)
Here we provide an example of vision-language pretraining.
1. **Download COCO.** You can also replace COCO with your own datasets.
2. **Pretraining**
```bash
cd one_peace/run_scripts/pretrain
bash pretrain_vl_3B.sh
```

### Audio-Language Pretraining (Stage2 Pretraining)
At the audio-language pretraining stage, we initialized the model with the pretrained checkpoint of vision-language pretraining, and trains the model with audio-text pairs.
1. **Download AudioCaps, Clotho and MACS.** You can also prepare your own datasets.
2. **Pretraining.** Remember to load the pretrained checkpoint of vision-language pretraining
```bash
cd one_peace/run_scripts/pretrain
bash pretrain_al_3B.sh
```
<br>

## Finetuing and Inference
### ESC-50
1. Download ESC-50
2. Inference
```bash
cd one_peace/run_scripts/esc50
bash zero_shot_evaluate.sh
```

### Image-Text Retrieval
1. Download COCO and Flickr
2. Finetuning
```bash
cd one_peace/run_scripts/image_text_retrieval
bash finetune_coco.sh
bash finetune_flickr.sh
```
3. Inference
```bash
cd one_peace/run_scripts/image_text_retrieval
bash zero_shot_evaluate_coco.sh  # zero-shot retrieval for COCO
bash zero_shot_evaluate_flickr.sh  # zero-shot retrieval for Flickr30K
bash evaluate_coco.sh  # evaluation for COCO
bash evaluate_flickr.sh  # evaluation for Flickr30K
```

### NLVR2
1. Download NLVR2
2. Finetuning
```bash
cd one_peace/run_scripts/nlvr2
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/nlvr2
bash evaluate.sh
```

### Visual Grounding
1. Download RefCOCO, RefCOCO+ and RefCOCOg
2. Finetuning
```bash
cd one_peace/run_scripts/visual_grounding
bash finetune_refcoco.sh
bash finetune_refcoco+.sh
bash finetune_refcocog.sh
```
3. Inference
```bash
cd one_peace/run_scripts/visual_grounding
bash evaluate_refcoco.sh  # evaluation for RefCOCO
bash evaluate_refcoco+.sh  # evaluation for RefCOCO+
bash evaluate_refcocog.sh  # evaluation for RefCOCOg
```

### VQA
1. Download VQAv2
2. Finetuning
```bash
cd one_peace/run_scripts/vqa
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/vqa
bash evaluate.sh
```

### Audio-Text Retrieval
1. Download AudioCaps, Clotho and MACS
2. Finetuning
```bash
cd one_peace/run_scripts/audio_text_retrieval
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/audio_text_retrieval
bash evaluate.sh
```

### Audio Question Answering (AQA)
1. Download AQA
2. Finetuning
```bash
cd one_peace/run_scripts/aqa
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/aqa
bash evaluate.sh
```

### FSD50K
1. Download FSD50K
2. Finetuning
```bash
cd one_peace/run_scripts/fsd50k
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/fsd50k
bash evaluate.sh
```

### Vggsound
1. Download Vggsound
2. Finetuning
```bash
cd one_peace/run_scripts/vggsound
bash finetune.sh
```
3. Inference
```bash
cd one_peace/run_scripts/vggsound
bash evaluate.sh
```



