# Reproducing Clustering → Inference → Bounds & gap(O)/gap(K)

This repository contains five scripts used to reproduce the experimental pipeline for our paper:

- `cluster_imagenet_train.py` — cluster the ImageNet **train** images (in splits).
- `cluster_imagenet_val10k_test40k.py` — cluster the ImageNet **val** images.
- `infer_imagenet_.py` — run model inference on ImageNet datasets and save logits/probabilities/errors.
- `compute_bound.py` — compute the proposed bounds and write results to an Excel file.
- `compute_gapK_gapO.py` — compute the gap(O) and gap(K) metrics and write CSV summaries.

The required order is:
1) **Clustering** (validation then training), 2) **Inference**, then 3) **Bounds** and **gapO/gapK**.

---

## 1) Environment & Dependencies

Tested with Python ≥ 3.9. Install dependencies (PyTorch version per your GPU/CUDA):
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121   # choose your CUDA/CPU index-url
pip install faiss-cpu pillow tqdm numpy pandas openpyxl
```
> If you use CUDA FAISS, install `faiss-gpu`/`faiss-cudaXXX` accordingly.

### Data prerequisites (ImageNet ILSVRC2012)
- **Validation images** (50k): files named `ILSVRC2012_val_XXXXXXXX.JPEG` in `VAL_DIR`.
- **Validation XML annotations**: files named `ILSVRC2012_val_XXXXXXXX.xml` in `VAL_XML_DIR`.
- **Training set**: 1k class subfolders (e.g., `n01440764/…`) in `TRAIN_DIR`.

The following structure is best to follow the code:
```
/path/to/imagenet/
  train/
    n01440764/
      *.JPEG
    ...
  val/
    ILSVRC2012_val_00000001.JPEG
    ...
  val_xml/
    ILSVRC2012_val_00000001.xml
    ...
```

---

## 2) Clustering

### 2.1 Cluster validation images
Produces `val10k_grouping_K{K}_seed{S}.json`, mapping cluster → list of `"<class>_<val_id>"`.
```bash
python cluster_imagenet_val10k_test40k.py   --val_dir /path/to/imagenet/val   --val_xml_dir /path/to/imagenet/val_xml   --output_dir /path/to/out/clusterings   --K 200   --seed 70   --batch_size 64
```
**Outputs (in `--output_dir`):**
- `val10k_grouping_K200_seed70.json`

### 2.2 Cluster training images (in splits)
`cluster_images_train.py` processes the ImageNet train set in *splits* so it fits in memory. Run it **once per split**, then merge is handled internally into the saved JSON.
```bash
# Run for each split index (e.g., 0,1,2,3). If the script accepts 0..4, run all 5.
for i in 0 1 2 3; do
  python cluster_imagenet_train.py     --train_dir /path/to/imagenet/train     --val_dir /path/to/imagenet/val     --val_xml_dir /path/to/imagenet/val_xml     --output_dir /path/to/out/clusterings     --split_index $i     --K 200     --seed 70
done
```
**Outputs (in `--output_dir`):**
- `train_group_K200_seed70.json`
- `val_grouping_K200_seed70.json` (created as a convenience by the script)

> Notes:
> - `--K` is the number of clusters.
> - `--seed` controls the centroid initialization and batching order.
> - `--split_index` is the chunk of the train set to process (commonly 0–3).

---

## 3) Inference on ImageNet

Run a torchvision model (e.g., `resnet50`) over the ImageNet datasets and save probabilities and per-image names:
```bash
python compute.py   --model resnet50   --data /path/to/imagenet --split [train | val]
```
**Output (in the working directory):**
- `{model}_imagenet_[train|val].pth` (e.g., `resnet50_imagenet_[train|val].pth`) containing:
  - `probs` (tensor/ndarray of class probabilities)
  - `img_names` (list of validation image basenames)
  - `accuracies` (per-batch or running accuracy stats)

---

## 4) Bound computation

Compute the bound statistics for either multiple **seeds** at a fixed `K` or multiple **K** at a fixed seed.

Common arguments:
- `--mode {seeds|byK}`
- `--model_name` (tag used for output rows)
- `--version` (integer indicating which model output version to use; default: 1)
- `--delta` (confidence parameter; default: 0.01)
- `--K` (for `seeds`: a single integer; for `byK`: a comma-separated list, e.g., `50,100,200`)
- `--seeds` (comma list, e.g., `50,65,70,83,100`)
- `--group_json_tpl` **template** for the clustering files. Use `{K}` and `{seed}` placeholders, e.g.:
  - `/path/to/out/clusterings/val_grouping_K{K}_seed{seed}.json`

### Example A — fixed K, vary seeds
```bash
python compute_bound.py   --mode seeds   --model_name resnet50   --K 200   --seeds 50,65,70,83,100   --version 1   --delta 0.01   --group_json_tpl /path/to/out/clusterings/val_grouping_K{K}_seed{seed}.json
```
### Example B — fixed seed, vary K
```bash
python compute_bound.py   --mode byK   --model_name resnet50   --K 50,100,200   --seeds 70   --version 1   --delta 0.01   --group_json_tpl /path/to/out/clusterings/val_grouping_K{K}_seed{seed}.json
```
**Output:**
- An Excel workbook `bound5.xlsx` in the working directory, with a row per run and sheets for each mode (script will create/append).

---

## 5) gap(O) and gap(K)

`compute_gapK_gapO.py` computes both gap metrics using the same clustering JSONs and writes CSV summaries.

Common arguments:
- `--mode {seeds|byK}`
- `--model_name` (tag used in rows)
- `--version` (model outputs version; integer)
- `--delta` (confidence parameter)
- `--K` (single int for `seeds`; comma list for `byK`)
- `--seeds` (comma list)
- `--train_group_json_tpl` template for *training* clusters, e.g.:
  - `/path/to/out/clusterings/train_group_K{K}_seed{seed}.json`
  - the script also reads the corresponding val grouping (same template without `train_` if needed).

### Example A — fixed K, vary seeds
```bash
python compute_gapK_gapO.py   --mode seeds   --model_name resnet50   --K 200   --seeds 50,65,70,83,100   --version 1   --delta 0.01   --train_group_json_tpl /path/to/out/clusterings/train_group_K{K}_seed{seed}.json
```
**Output (working directory):**
- `gapOandK.csv` (appended if exists).

### Example B — fixed seed, vary K
```bash
python compute_gapK_gapO.py   --mode byK   --model_name resnet50   --K 50,100,200   --seeds 70   --version 1   --delta 0.01   --train_group_json_tpl /path/to/out/clusterings/train_group_K{K}_seed{seed}.json
```
**Output (working directory):**
- `gapOandK_byK_result.csv` (appended if exists).

---

