# Code Structure and Usage Instructions

## Training Code (/train)

### Overview
The training code is adapted from the LiLT codebase (https://github.com/codezakh/LilT). It's designed to train a model using multiple datasets and a specific configuration.

### Setup Instructions
1. Download the following datasets into `./image_data` using img2dataset (https://github.com/rom1504/img2dataset):
   - cc3m
   - cc12m
   - sbu
   - laion_class_collected_6m
   - ImageNet validation set

2. Configure the paths in the `dinov2-arl-wds-combined.yaml` file.

3. Set up the environment:
   - Use the `environment.yaml` file to create the conda environment
   - Run `bash extra_install.sh` for additional setup

### Training Command
Use the following command to start training:

```bash
python -m torch.distributed.launch --master_port=43770 --nproc_per_node=8 --use_env PretrainHydra.py --config dinov2-arl-wds-combined --output_dir ./storage/lilt_cache/dinov2-arl-local-vis-patch-local-text-patch-text-mlp-textpool-mean-imageprotos-cc3m-cc12m-sbu-16k_bs/ --overrides +save_last_only=False fp16=True disable_wandb=False text_pooling=mean local_vision_projection=patch local_text_projection=patch text_projection=mlp
```

This command uses the `dinov2-arl-wds-combined` configuration and specifies various training parameters.

## Data Collection Code (/collection)

### Overview
The data collection process involves several steps to process the LAION dataset, compute embeddings, calculate similarity scores, and collect the best samples.

### Steps and Usage

1. **Download LAION Parquets**
   - Download all LAION parquets to `/laion400m-meta`

2. **Compute Embeddings (getting_laion_embeds.py)**
   - Usage: `python getting_laion_embeds.py --gpu <GPU_ID> --b <BATCH_SIZE> --m <MODEL> --p <PART>`
   - Example: `python getting_laion_embeds.py --gpu 0 --b 4096 --m clip --p 0`
   - Computes embeddings for LAION dataset samples
   - Output: Embedding files saved in the specified LAION location

3. **Calculate Scores (scores_new.py)**
   - Usage: `python scores_new.py --gpu <GPU_ID> --b <BATCH_SIZE> --p <PART>`
   - Example: `python scores_new.py --gpu 0 --b 4096 --p 0`
   - Calculates cosine similarity scores between LAION embeddings and ImageNet classes
   - Output: Score files saved in the specified LAION location

4. **Sort Samples (sort_samples.py)**
   - Usage: `python sort_samples.py --p <PART> --max <MAX_SAMPLES> --b <BATCH_SIZE> --sort_b <SORT_BATCH_SIZE> --gpu <GPU_ID>`
   - Example: `python sort_samples.py --p 0 --max 50000 --b 4096 --sort_b 655536 --gpu 0`
   - Sorts samples for each class cumulatively per part
   - Output: Sorted scores and sample IDs saved in the results directory

5. **Collect and Deduplicate (collect_fast.py)**
   - Usage: `python collect_fast.py --parts <NUM_PARTS> --max <MAX_SAMPLES>`
   - Example: `python collect_fast.py --parts 1 --max 20000`
   - Collects the best samples from all parts and removes duplicates
   - Output: A final parquet file containing the best, deduplicated samples with metadata, assigned concept, and score. collection_2754_classes_2000_samples_mean_tempselect_1000samples__2754classes_corrected_1000samples_dummy.snappy.parquet is a dummy parquet file- with 1000 samples
   of our collection.

### Important Note
Perform each step for all parquet parts in the uncurated set before moving to the next step. For example, complete step 1 for all parts, then step 2 for all parts, and so on.








<!-- ignore below -->






<!-- 
Training code in /train
Data Collection code in /collection

Training Code Instructions:

Adapted from LiLT codebase https://github.com/codezakh/LilT

Download the datasets cc3m, cc12m, sbu and laion_class_collected_6m, and imagenet valset into ./image_data using img2dataset https://github.com/rom1504/img2dataset

Setup the paths in dinov2-arl-wds-combined.yaml file

Setup environment using environment.yaml file followed by bash extra_install.sh

Use config dinov2-arl-wds-combined to run the following training script; 
python -m torch.distributed.launch --master_port=43770 --nproc_per_node=8 --use_env PretrainHydra.py --config dinov2-arl-wds-combined  --output_dir ./storage/lilt_cache/dinov2-arl-local-vis-patch-local-text-patch-text-mlp-textpool-mean-imageprotos-cc3m-cc12m-sbu-16k_bs/ --overrides +save_last_only=False  fp16=True disable_wandb=False text_pooling=mean local_vision_projection=patch local_text_projection=patch text_projection=mlp


Training takes 50 hours on 8 A100 GPUs

Collection Code Instructions:

0. Download all laion parquets to /laion400m-meta
1. getting_laion_embeds.py for getting all the embeddings2. scores.py for getting cos sims with classes
2. scores_new.py for calculating scores of all caption embeddings wit the concept prototypes.
3. sort_samples.py to sort all the samples for each class. this is done cumulatively per part
4. collect_fast.py to de duplicate everything 

1. getting_laion_embeds.py
This script computes embeddings for the LAION dataset samples.
Usage:
python getting_laion_embeds.py --gpu <GPU_ID> --b <BATCH_SIZE> --m <MODEL> --p <PART>

Ex.
python getting_laion_embeds.py --gpu 0 --b 4096 --m clip --p 0

--gpu: GPU ID to use (default: 0)
--b: Batch size for processing (default: 10000)
--m: Model to use for embedding ("clip" or "allroberta")
--p: Part of the LAION dataset to process

Output: Embedding files saved in the specified LAION location.

2. scores_new.py
This script calculates cosine similarity scores between LAION embeddings and ImageNet classes.
Usage:
python scores_new.py --gpu <GPU_ID> --b <BATCH_SIZE> --p <PART>

Ex.
python scores_new.py --gpu 0 --b 4096 --p 0

--gpu: GPU ID to use
--b: Batch size for processing
--p: Part of the LAION dataset to process

Note: make sure the batch size is the same as the previous step
Output: Score files saved in the specified LAION location.


3. sort_samples.py
This script sorts the samples for each class cumulatively per part.
Usage:
python sort_samples.py --p <PART> --max <MAX_SAMPLES> --b <BATCH_SIZE> --sort_b <SORT_BATCH_SIZE> --gpu <GPU_ID>

Ex.
python sort_samples.py --p 0 --max 50000 --b 4096 --sort_b 655536 --gpu 0



--p: Part of the LAION dataset to process
--max: Maximum number of samples to keep per class
--b: Batch size for processing
--sort_b: Batch size for sorting
--gpu: GPU ID to use

Note: make sure the batch size is the same as the batch_size in the previous step
NOTE: run this function part by part in sequence
Output: Sorted scores and sample IDs saved in the results directory.

4. collect_fast.py
This script collects the best samples from all parts and removes duplicates.
Usage:
python collect_fast.py --parts <NUM_PARTS> --max <MAX_SAMPLES>

Ex. 
python collect_fast.py --parts 1 --max 20000

--parts: Number of parts to process
--max: Maximum number of samples to keep per class
--per_class 2000


Output: A final parquet file containing the best, deduplicated samples with their metadata, assigned ImageNet class, and score.


Important note; Do step 1 for all the parquet parts in uncurated set, then step2 for all the parts, Step 3 for all the parts .. so on.





 -->
