# Caption dataset generation

This README outlines the process for generating captions, substring matching, and balanced sampling of the generated synthetic captions, plus the generation of hard negative captions from the resulting dataset.

### Step 1: Generate Captions

First, install vllm

```bash
pip install -U vllm
```

Once installed, go to .local/lib/python3.11/site-packages/vllm/transformers_utils/tokenizers/mistral.py, in find_tokenizer_file method and change the regex pattern for the tokenizer file from:
```
r"^tokenizer\.model\.v.*$|^tekken\.json$|^tokenizer\.mm\.model\.v.*$"
```
to:
```
r"^tokenizer\.model$|^tokenizer\.model\.v.*$|^tokenizer\.mm\.model\.v.*$|^tekken\.json$"
```

Then start by generating captions using the `scaled_cap_generation.py` script. This will create a set of synthetic captions from the concept bank json file and given an attribute.

```bash
HF_TOKEN="..." python scaled_cap_generation.py --llm-name mistral02 \
                             --save-folder path/to/save/folder \
                             --dp-size num_gpus \
                             --concept-path path/to/concept_bank.json\
                             --attribute color
```

### Step 2: Substring matching

Before balanced sampling, `substring_matching.py` saves the list of matches to the concept bank for each caption.

```bash
python substring_matching.py --num_processes <num_processes> \
                             --synthetic_captions_folder <path to folder with json generated in step 1> \
                             --captions_with_count_folder <path to save folder>
                             --metadata_filepath path/to/concept_bank.json 
```

### Step 3: Balanced sampling

Running `balanced_sampling.py` saves the subset of captions, concepts and starting indices of the used attributes sampled with the MetaCLIP balanced sampling algorithm.

```bash
python balanced_sampling.py --num_processes <num_processes> \
                            --captions_with_count_folder <path to folder with json generated in step 2> \
                            --balanced_captions_folder <path to save folder> \
                            --metadata_filepath path/to/concept_bank.json \
                            --t 20000
```

Incrementing or decrementing the value of `--t` will increase or decrease the sampling probability, thus return more or less caption, respectevely.

### Step 4: Hard negative generation

Finish by generating the hard negative captions and save the csv ready for image generation and for training.

```bash
HF_TOKEN="..." python scaled_hn_generation.py --llm-name llama \
                             --curated-captions-folder <path to folder with json generated in step 3> \
                             --save-folder path/to/save/folder \
                             --concept-path path/to/concept_bank.json \
                             --dp-size num_gpus \
                             --batch-size-per-gpu 150000 \
                             --t <t value used in step 3>
```