# Naming-Biases (SaMyNa) Project -- Bias Exemplars Mining

Source code implementation for the submission to ICLR2025 with Submission ID 11221

## Hardware and Software Requirements

Regarding software:
- The operating system should be a linux distribution (we tested on Ubuntu 20.04.6 LTS, and Red Hat Enterprise Linux release 8.7)
- python 3.11 is recommended, 3.10 should suffice but we do not guarantee it
- CUDA 12.1 or greater (may work with 11.8, but we have not tested it, we tested on 12.1)
- An nvidia driver compatible with the CUDA version you are using
- venv+pip
    - alternatively you can use python conda, but instructions are provided only for venv+pip.

Compute time:
- The longest training is for CelebA, which can require multiple hours with an NVIDIA RTX 5000

## Setup

First of all you must setup a venv environment, and install dependencies with pip:

```
python3 -m venv <env_path>
source <env_path>/bin/activate
pip install -r requirements.txt
```

After this, you must add to your .bashrc (or similar file) the following environmental variable, so that output from the bias mining part is copied on right folder expected from the bias-naming step. This environment variable will be used also for the second step:

```
# WARNING: change the path to a location that has at least 256 GB of free disk space, use an absolute path

echo "export NAMING_BIASES_DATA_PATH=/path/to/data/naming-biases" >> ~/.bashrc
```
Now, in order to enable the environment variable and to setup the right folders, do
```
source ~/.bashrc                       # or similar, like .zshrc if you use zsh
mkdir $NAMING_BIASES_DATA_PATH
mkdir $NAMING_BIASES_DATA_PATH/datasets
```

## Output of the bias mining step

The part of the algorithm in which we extract bias exemplars is indipendent, therefore you will find in the main directory of this project a folder named `medoid_results`, which will be populated with the results (txt files with IDs and images to be captioned). Additionally, we support logging the output to Weights & Biases through a zip artifact, which is also stored locally by default. In order to facilitate reproducing the results, we also make sure to copy the output in the folder `$NAMING_BIASES_DATA_PATH/datasets`. 

##### Warning
If you encounter any problem due to the environment variable, you can comment line 20 in `extract_bias_exemplars.py`, and unzip the generated archive in the right folder, following the instructions for the second step.

## Available Datasets
We provide automatic download of the required datasets through their implementation in the folder `datasets`, a part from the validation set of ImageNet-1K (which is required for the ImageNet-A experiments), as the download requires authentication on either ImageNet official website or through external providers such as hugging faces.

## Availabe Commands
The main script for all the available experiments is `extract_bias_exemplars.py`, which can be configured through command line arguments:
```
usage: extract_bias_exemplars.py [-h] --dataset DATASET [--use_wb USE_WB] [--retrain RETRAIN] [--model MODEL] [--evaluate_test EVALUATE_TEST]

options:
  -h, --help                    show this help message and exit
  --dataset DATASET             dataset name. choose in [waterbirds, bar, celeba, imagenet-a]
  --use_wb USE_WB               whether to use weights and biases logging or not, default=true
  --retrain RETRAIN             repeat experiment and overwrite vanilla model, default=false
  --model MODEL                 which model to use, default=resnet50. vitb16 available for Waterbirds and Imagenet-A, swinv2b available for ImageNet-A
  --evaluate_test EVALUATE_TEST run model in inference against misaligned samples and extract exemplars, default=false
  --ablation_on_k ABLATE_ON_K     Ablation study on K for Waterbirds and ResNet-50, default=false, if set to true overwrites other arguments
```
### Example commands

+ ```./extract_bias_exemplars.py --dataset waterbirds --model resnet50 --use_wb true```  
    This command runs exemplars extraction on Waterbirds with a ResNet50, as described in Section 3.1 of the main paper
+ ```./extract_bias_exemplars.py --dataset waterbirds --model resnet50 --use_wb true --evaluate_test```  
    Runs the vanilla model in inference and provides separate accuracies for aligned and conflicting samples
+ ```./extract_bias_exemplars.py --dataset imagenet-a --model resnet50 --use_wb true```  
    This command runs inference bias exemplars extraction on the full ImageNet-A with a ResNet50, as described in Section 3.1 of the main paper. The specific classes we analyzed in the main paper (_insects-on-hand_) refers to classes 124, 306, 313, and 314 of ImageNet. We filter the output directly to this 4 classes to facilitate the handling of the output. If you desire to disable this, comment from line 520 to 529 in `extract_bias_exemplars.py`. Notice that this will provide an output that is intended for visual inspection and is not compatible with the second step
+ ```./extract_bias_exemplars.py --dataset imagenet-a --model vitb16 --use_wb true```  
     This command runs inference bias exemplars extraction on the full ImageNet-A with a ViT-B16, as described in the supplementary materials
+ ```./extract_bias_exemplars.py --dataset imagenet-a --model swinv2b --use_wb true```  
     This command runs inference bias exemplars extraction on the full ImageNet-A with a ViT-B16, as described in the supplementary materials
+ ```./extract_bias_exemplars.py --dataset waterbirds --ablation_on_k true```  
    This command runs on Waterbirds with multiple versions of K-medoids: [1, 5, 10, 25, 50]. Only one vanilla model is trained
+ ```./extract_bias_exemplars.py --dataset <dataset_name> --use_wb false```  
    This command runs on <dataset_name> with its default model, without weights and biases logging

## Folders and Files
+ data: contains saved models, and dataset files
+ datasets: datasets implementations
+ medoids_results: stores the raw output folders. Not present at first, will be automatically generated
    + wandb_wrapper.py: utility for wandb
    + vanilla_builder.py: factory for datasets and models
    + ```extract_bias_exemplars.py```: **main script**   

## Additional Instructions for the Experiment on ImageNet-A
As ImageNet-1K is not directly available with a public link, you must obtain the validation set (which is sufficient, you don't need the training set) from other sources, such as the official website or external providers requiring authentication like hugging faces. Once you have the tarball of the validation set, you must extract it into the `data` folder inside a directory called ```ILSVRC2012_val_images```. All the rest is handled automatically. The relative path of the images with respect to the main project folder, then, must be ```./data/ILSVRC2012_val_images```. The specific classes we analyzed in the main paper (_insects-on-hand_) refers to classes 124, 306, 313, and 314 of ImageNet. We filter the output directly to this 4 classes to facilitate the handling of the output. If you desire to disable this, comment from line 520 to 529 in `extract_bias_exemplars.py`. Notice that this will provide an output that is intended for visual inspection and is not compatible with the second step.


## Mitigating Discovered Bias 
The Bias Mitigation step described in Section 4.4 of the main paper can be executed in two simple steps:
1. ```clip_pseudolabeling --dataset <dataset_name>```
    + Availbale datasets are Waterbirds, CelebA (Hair Color), and BAR
2. Depending on the specific target dataset, we provide a specific script file in the main folder of the repository. Specifically:
    + ```bash waterbirds_samyna.sh```
    + ```bash celeba_samyna.sh```
    + ```bash bar_samyna.sh```
In the bias pseudo-labeling  we already employ the found bias keywords, thus to perform this step it is not necessary to run the Keyword Extraction procedure (Section 3.2)
