
# Towards Fine-grained Audio Captioning with Multimodal Contextual Cues 





## 🚀 Quick Start

### Environment Setup

```bash
# Create conda environment
conda create -n FusionAudio python=3.10
conda activate FusionAudio

# Install dependencies
pip install -r requirements.txt
pip install -e src/GAMA/hf-dev-train/transformers-main
pip install -e src/GAMA/peft-main
```

### Quick Inference

We provide an easy-to-use inference script `quick_inference.py` that supports both command-line and Python API usage.

#### Command Line Usage

```bash
python quick_inference.py \
    --base_model /path/to/Llama-2-7b-chat-hf-qformer \
    --model_path /path/to/fusionaudio_checkpoint.pth \
    --audio /path/to/your/audio.wav \
    --question "Please describe this audio in detail."
```

#### Python API Usage

```python
from quick_inference import FusionAudioInference

# Initialize inferencer
inferencer = FusionAudioInference(
    base_model_path="/path/to/Llama-2-7b-chat-hf-qformer",
    model_path="/path/to/fusionaudio_checkpoint.pth",
    device="cuda:0"
)

# Audio captioning
response = inferencer.predict(
    audio_path="/path/to/your/audio.wav",
    question="Please describe this audio in detail."
)
print(f"Audio description: {response}")
```

For detailed parameter descriptions, run `python quick_inference.py --help`.


## 🏋️ Training

### Preprocessing

1. Download Llama-2-7b-chat-hf-qformer model (refer to [GAMA README](https://github.com/Sreyan88/GAMA))
2. Update the model path in `src/GAMA/gama_finetune.py` at lines 96 and 101

### Start Training

```bash
conda activate FusionAudio
cd scripts/train/
bash train.sh
```

## 📈 Evaluation

### Classification Task Evaluation

```bash
cd scripts/eval
bash eval_cls.sh
```

### Captioning Evaluation

```bash
cd scripts/eval  
bash infer.sh
```

### Retrieval Task Evaluation

```bash
# Environment preparation (refer to WavCaps repository)
# 1. Configure environment according to https://github.com/XinhaoMei/WavCaps/tree/master/retrieval
# 2. Set ckpt_path in inference.yaml
# 3. Put eval_retrieval.py into the downloaded retrieval folder

cd scripts
python eval_retrieval.py
```

## 📋 Data Statistics

![statistics](imgs/statistics.png)

