# Reproducing Clustering → Inference → Bounds (Segmentation on COCO)

This repository contains three scripts used to reproduce the experimental pipeline for semantic segmentation models:

- `cluster_coco.py` — cluster the COCO images into K groups (by random centroids + FAISS).
- `infer_coco.py` — run a torchvision segmentation model on COCO, save per-image metrics & masks.
- `compute_bound.py` — compute the proposed generalization bounds across seeds and clusters.

The required order is:
1) **Clustering**, 2) **Inference**, then 3) **Bounds**.

---

## 1) Environment & Dependencies

Tested with Python ≥ 3.9. Install dependencies (PyTorch version depends on your GPU/CUDA):

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install faiss-cpu pillow tqdm numpy pycocotools
```

> If you have CUDA FAISS available, install `faiss-gpu` or `faiss-cudaXXX`.

### Data prerequisites (COCO 2017)
- **Images**: `train2017` and/or `val2017`, each containing `.jpg` files named `000000XXXXXX.jpg`.
- **Annotations**: JSON annotation files from COCO, e.g.:
  - `annotations/instances_train2017.json`
  - `annotations/instances_val2017.json`

Example structure:
```
/path/to/coco/
  train2017/
    000000000009.jpg
    ...
  val2017/
    000000000139.jpg
    ...
  annotations/
    instances_train2017.json
    instances_val2017.json
```

---

## 2) Clustering

Clusters COCO images into K groups by flattening resized tensors and assigning them to random centroids.

```bash
python cluster_coco.py \
  --data_dir /path/to/coco/train2017 \
  --split train \
  --K 75 \
  --seed 42 \
  --output_dir groupings
```

```bash
python cluster_coco.py \
  --data_dir /path/to/coco/val2017 \
  --split val \
  --K 75 \
  --seed 42 \
  --output_dir groupings
```

**Outputs (in `--output_dir`):**
- `grouping_K_{K}_seed_{seed}.json`  
  JSON with keys per cluster and subkeys `"train"` and `"val"`, e.g.:

```json
{
  "0": { "train": [100001, 100002], "val": [200001] },
  "1": { "train": [100010], "val": [] }
}
```

> Running train and val sequentially with the same `--output_dir` merges both into the same file per seed.

---

## 3) Inference on COCO

Run a segmentation model and compute metrics (mIoU, pixel accuracy, per-class IoUs, dice loss).  

Supported models:
- `deeplabv3_resnet101`
- `deeplabv3_resnet50`
- `deeplabv3_mobilenet_v3_large`
- `fcn_resnet101`
- `fcn_resnet50`

### Example: Validation set
```bash
python infer_coco.py \
  --split val \
  --model deeplabv3_resnet101 \
  --image_root_dir /path/to/coco/val2017 \
  --annotation_file_path /path/to/coco/annotations/instances_val2017.json
```

### Example: Training set
```bash
python infer_coco.py \
  --split train \
  --model deeplabv3_resnet101 \
  --image_root_dir /path/to/coco/train2017 \
  --annotation_file_path /path/to/coco/annotations/instances_train2017.json
```

**Outputs:**
- `outputs/{model}/train_inference.pth`
- `outputs/{model}/val_inference.pth`

Each file contains:
- `image_ids` (list of COCO integer IDs)
- `mean_ious` (per-image mIoU values)
- `per_class_ious` (IoU vector per image/class)
- `pixelwise_accs` (per-image pixel accuracy)
- `rle_pred_masks` and `rle_true_masks` (COCO-compatible RLE encodings of masks)

---

## 4) Bound computation

After clustering and inference, compute the bounds across multiple seeds.

```bash
python compute_bound.py \
  --inference_dir outputs \
  --group_dir groupings \
  --out_dir bounds \
  --K 75
```

### What it does
- Reads all grouping JSONs: `grouping_K_75_seed_{s}.json`
- Loads inference losses:  
  - `outputs/{model}/train_inference.pth`  
  - `outputs/{model}/val_inference.pth`
- Splits validation images deterministically into `val_1k` and `test_4k`.
- Computes TD25 `(5)` and OLD `(3)` bounds across seeds.
- Writes per-model CSVs:
  - `bounds/{model}/metrics.csv`
  - `bounds/{model}/bound_train_miou.csv`
  - `bounds/{model}/bound_val_miou.csv`
  - `bounds/{model}/bound_train_miou_old.csv`
  - `bounds/{model}/bound_val_miou_old.csv`
- Aggregates across models into:
  - `aggregate_train_summary.csv`
  - `aggregate_val_summary.csv`

---

## 5) Expected Workflow

1. **Cluster** both train + val for multiple seeds:
   ```bash
   for s in 50 65 70 83 100; do
     python cluster_coco.py --data_dir /coco/train2017 --split train --K 75 --seed $s --output_dir groupings
     python cluster_coco.py --data_dir /coco/val2017   --split val   --K 75 --seed $s --output_dir groupings
   done
   ```

2. **Inference** for each model × split:
   ```bash
   for model in deeplabv3_resnet101 fcn_resnet50; do
     python infer_coco.py --split train --model $model \
       --image_root_dir /coco/train2017 --annotation_file_path /coco/annotations/instances_train2017.json
     python infer_coco.py --split val --model $model \
       --image_root_dir /coco/val2017 --annotation_file_path /coco/annotations/instances_val2017.json
   done
   ```

3. **Compute bounds**:
   ```bash
   python compute_bound.py \
     --inference_dir outputs \
     --group_dir groupings \
     --out_dir bounds \
     --K 75
   ```

---

## 6) Notes
- `cluster_coco.py` uses random initialization of centroids. Use consistent seeds for reproducibility.
- Grouping JSONs are cumulative: running `train` and then `val` with the same seed merges both splits.
- `compute_bound.py` expects inference results and groupings to match in image IDs (COCO integer IDs).
- Metrics are saved in CSV per-model and aggregated across models for convenience.
