# What is _Dolph2Vec_ ?

_Dolph2Vec_ is the first self-supervised model pre-trained exclusively on dolphin vocalizations. We adapt the Wav2Vec2.0 architecture to a custom large-scale dataset of dolphin sounds.

This document provides instructions to reproduce the results reported in the Dolph2Vec paper, including signature whistle classification and dolphin call detection using logistic regression on various feature representations.

## Environment Setup

Install required packages:

```bash
pip install -r requirements.txt
```

## Batch Evaluation: Classification and Detection

To replicate the full table reporting both classification and detection using all models and regularization values run this command:

```bash
bash src/run_all.sh
```

This will return a txt file with evaluation metrics, specifically:

- mean Accuracy and standard deviation (for classification)
- mean Average Precision (mAP) and standard deviation (for detection)

The above command will run the following bash script:

```bash
#!/bin/bash

models=("dolph2vec" "aves_core" "aves_bio" "biolingual" "mfcc" "spectrogram" "spectral_features")
inverse_regs=(0.1 1.0 10.0)
datasets=("detection" "dolphin_reef_unbalanced")

for model in "${models[@]}"; do
  for dataset in "${datasets[@]}"; do
    for reg in "${inverse_regs[@]}"; do
      echo "Running model: $model with inverse_reg: $reg on dataset $dataset"
      python train_lr_kfold.py --model "$model" --inverse_reg "$reg" --dataset_name "$dataset"
      echo "===================================================="
    done
  done
done
```

Note the downstream tasks are defined as:

`dolphin_reef_unbalanced`: classification of signature whistles

`detection`: detection of whistle plus, if present, classification of signature whistle  


### Single-Run Examples

To obtain a single score for either classification or detection, for a single model on a given dataset, follow the examples below.

#### Example 1: Classification with MFCC baseline:

```python
python run_lr_logreg.py \
  --seed 123 \
  --inverse_reg 1.0 \
  --kfold 5 \
  --normalize_data \
  --dataset_name dolphin_reef_unbalanced \
  --model mfcc \
```


#### Example 2: Detection with Dolph2Vec:

```python
python run_lr_logreg.py \
  --seed 123 \
  --inverse_reg 1.0 \
  --kfold 5 \
  --dataset_name detection \
  --model dolph2vec \
```


Not that `--normalize_data` is required for acoustic baselines only (MFCC, spectrogram, spectral_features).

All models use 5-fold cross-validation unless otherwise specified.


## GMM Clustering, RSA and UMAP visualization of embeddings

You can obtain the GMM clustering and UMAP plots found in section 4.4, by running the following steps:


### First save embeddings for all models

Run:

```bash
bash src/extract_all.sh
```

This will create an `embeddings` folder to store the output.
To run the clustering analysis, do:

```bash
python clustering_analysis.py
 ```

and for the RSA analysis:

```bash
python rsa.py
```


## Quantized Representation Analysis and Visualization

To reproduce visualizations and analyses of the codebook found in section 4.5 run command:

```bash
python extract_quantized.py \
  --seed 42 \
  --dataset_name dolphin_reef_balanced \
  --model custom_quant \
  --outfolder /path/to/output \
```

This will extract the quantized representations from _Dolph2Vec_, plot co-occurrence matrices, save information metrics (entropy, mutual information), and save plots with UMAP quantized latents visualization in the specified `--outfolder`.


### Which datasets and models can I choose?

| Argument         | Choices                                                                 |
|------------------|-------------------------------------------------------------------------|
| `--dataset_name` | `dolphin_reef_balanced`  <br> `dolphin_reef_unbalanced` <br> `detection` <br> `binary_detection`|
| `--model`        | `dolph2vec` <br> `dolph2vec-shuffle` <br> `aves_core` <br> `aves_bio` <br> `biolingual` <br> `mfcc` <br> `spectrogram` <br> `spectral_features` |
