### code and gt/pred files are provided for reproducing evaluation metrics shown in the paper.

### please note that gemini results can be non-deterministic, hence using gemma-inference will show true reproducibility

### 0. setup virtualenv using requirements.txt

virtualenv -p python3 f4-its
source f4-its/bin/activate
pip install -r requirements.txt

### 1. download dataset from hugging-face. please note that f4-its only evaluate a subset of this entire dataset to prove that the hypothesis works. you can refer to the GT json files to know the image filenames that are used.
https://huggingface.co/datasets/jesusmolrdv/MTF25-VLM-Challenge-Dataset-Web -> download this data and place it under a folder "images/" which will be used for evaluation.


### 2. generate VLM captions(dense and sparse) for all test images. this is a preprocessing step needed before we run evaluation. this step generates data for both top-1 and top-k evaluation.

python gemma_inference.py images/ "path to save dense captions.json" "path to save sparse captions.json"

### list of open_clip model variants evaluated

ViT-B-32-256, datacomp_s34b_b86k
ViT-L-14-CLIPA-336 datacomp1b
ViT-H-14-CLIPA-336 datacomp1b
ViT-g-14, laion2b_s34b_b88k
ViT-bigG-14-CLIPA-336, datacomp1b
ViT-H-14-378-quickgelu, dfn5b
ViT-L-16-SigLIP2-512, webli


### 3. run evaluation for top-1

python evaluate_top1.py images/ gt/top1/real_gt_dense_final.json gt/top1/real_gt_sparse_final.json pred/top1/gemma_real_13K_dense.json pred/top1/gemma_real_13k_sparse.json False False True ViT-g-14 laion2b_s34b_b88k 0.7 0.3

    - False False True -> first two arguments is to use sparse captions, by default pass it as False as we want to evaluate on rich dense captions. the last argument(True) is to enable feature fusion.
    - ViT-g-14 laion2b_s34b_b88k is the open_clip arch/dataset to be used
    - 0.7 and 0.3 are the image and text weights respectively

### 3. run evaluation for top-k

python evaluate_topk.py images/ gt/topk/gt_sparse_all_items.json gt/top1/real_gt_sparse_final.json pred/top1/gemma_real_13k_sparse.json True ViT-H-14-378-quickgelu dfn5b

    - True represents feature fusion. keep it as True
    - ViT-H-14-378-quickgelu dfn5b is the open_clip arch/dataset to be used