# Prior Distribution & Model Confidence

---

## Table of Contents
1. [Overview](#overview)  
2. [Models](#models)  
   - [Classification Models](#classification-models)  
   - [Embedding Models](#embedding-models)  
3. [Quickstart](#quickstart-or)  
4. [Orchestrators](#orchestrators)  
5. [ClickHouse + ChromaDB](#clickhouse--chromadb)  
6. [Embedding Storage (ClickHouse + Chroma)](#embedding-storage-clickhouse--chroma)  
7. [Dataset Preparation](#dataset-preparation)  
8. [Adjusting (Training)](#adjusting-training)  
9. [Inference & Benchmarking](#inference--benchmarking)  
10. [Utility Scripts](#utility-scripts)  
11. [Config](#config)  
12. [Project Structure](#project-structure)  
13. [License & Acknowledgements](#license--acknowledgements)  
---

## Overview

This repository provides:

- Classification and embedding models pretrained on ImageNet  
- A ChromaDB vector database (backed by ClickHouse) for storing and querying embeddings  
- End-to-end pipelines for dataset preparation, embedding storage, and inference benchmarking  
- Utility scripts for data inspection and result analysis  

---

## Models

### Classification Models

- `deit_tiny_patch16_224`
- `deit_small_patch16_224`
- `deit_base_patch16_224`
- `resnet50`
- `resnet101`
- `shufflenet_v2_x1_0`

### Embedding Models

- `mobilenet_v2`
- `dinov1_s`
- `dinov1_b`
- `dinov1_b8`
- `dinov2_s`
- `dinov2_b`

---

## Quickstart

1. **Clone the repository**  
   ```bash
   git clone https://github.com/SOME_USER/prior_hallucinations
   cd your-repo
   ```

2. **Set up a Python environment**  
   ```bash
   python3 -m venv .venv
   source .venv/bin/activate
   ```

3. **Install dependencies**  
   ```bash
   pip install -r requirements.txt
   ```

4. **(Optional) Run a quick test** to verify the setup:  
   ```bash
   python3 utils/count_rows.py
   ```

## Config

Under `config/`:
- `config.py` – Contains folder names, `base_dir`, and other relative paths.  

---

## Orchestrators

The **orchestrators** serve as high-level entry points for the main processing stages of the project.  
They wrap and coordinate multiple scripts from different modules, allowing you to run the core workflows with a single command.

All orchestrators are located in the `orchestrator/` directory.

**Usage example:**
```bash
python3 orchestrator/process_embeddings.py
```
| Script                               | Purpose                                                                                                                                                                                                            |
|--------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `orchestrator/process_embeddings.py` | Generates and stores embeddings for the **original training dataset** using the selected classification–embedding model pairs. See the [ClickHouse + ChromaDB](#clickhouse--chromadb) section for storage details. |
| `orchestrator/train.py`              | Runs the training/adjustment pipeline to determine the optimal value of **N** that maximizes *Normalized Confidence Gain*. See the [Adjusting (Training)](#adjusting-training) section for details.                |
| `orchestrator/test_internal.py`      | Executes the inference and evaluation pipeline, calculating performance metrics on the internal test set. See the [Inference & Benchmarking](#inference--benchmarking) section for details.                      |
| `orchestrator/test_external.py`      | Executes the inference and evaluation pipeline, calculating performance metrics on the external test set. See the [Inference & Benchmarking](#inference--benchmarking) section for details.                      |
## ClickHouse + ChromaDB

### Installing ChromaDB
```bash
pip install chromadb
```

### Running the ChromaDB Server
```bash
chroma run --path /absolute/path/to/db --port 8010
```

**Example: Connecting from Python**
```python
import chromadb
client = chromadb.HttpClient(host="localhost", port=8010)
print(client.list_collections())
```

---

## Embedding Storage (ClickHouse + Chroma)

Scripts under `clickhouse`:

| Script                               | Purpose                                                                                                               |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| `utils/check_collections.py`         | Check all collections and count the number of records in each.                                                        |
| `utils/count_repeated_elements.py`   | Count repeated embeddings in each collection.                                                                         |
| `utils/count_unique_embs.py`         | Count unique embeddings in each collection.                                                                           |
| `utils/test_choose.py`               | Test searching the collections.                                                                                       |
| `utils/test_hgs.py`                  | Check how hypergraph search (HGS) retrieves images one-by-one.                                                        |
| `process_embeddings.py`              | Process embeddings for each image (for each classification model) and store vectors in ChromaDB.                      |

Run any via:
```bash
python3 clickhouse/utils/<script_name>.py
```

---

## Adjusting (Training)

| Script                               | Purpose                                                                                                               |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| `adjusting/analysis.py`              | Analyzes a pair (classification model + embedding model), storing values of N and L (number of neighbors within L).  |
| `adjusting/analysis_multiple.py`     | Same as above but handles multiple model pairs.                                                                       |
| `benchmark_performance.py`           | Computes predicted labels for each training set image and reports benchmark accuracy.                                 |
| `find_best_integral.py`              | Processes analysis results, integrates the area under the Confidence Curve, and finds N that maximizes Normalized Confidence Gain. |
| `store_best_N.py`                    | Filters the main CSV to keep only rows for the best N.                                                                |
| `search_embeddings.py`               | Searches for the 1,000 nearest neighbors from the ImageNet-1K base set for each training image embedding.             |
| `plot_integrals.py`                  | Plots Normalized adjusted integrals for different model pairs across various N.                                       |
| `plot_analysis_single.py`            | Plots Normalized Confidence Gain and Confidence Curves for a single model pair.                                       |

---

## Dataset Preparation

### IMAGENETV2  
1. Copy the **ImageNetV2** dataset and its corresponding subfolders to the folder `IMAGENET_DATA_DIR`, and rename it to `data`.  
2. More details on the dataset can be found [here](https://imagenetv2.org/).  
3. In the `data` folder located at `YOUR_BASE_DIR/imageNetV2/data/`, you should have all subfolders from the original ImageNet dataset.  
4. If necessary, rename folders to match the required structure.

---

### OBJECTNET  
> **Disclaimer:** The dataset was only used for final metrics. No training or parameter tuning was performed.  

1. Download the dataset from the original link.  
2. Resize the images using the script `datasets_utils/resize_server.py`.  
3. Store the processed data in `DATASET_EXTERNAL` following the preprocessing instructions in the script.  

---

### Helper Scripts for Test Sets  
> **Note:** File paths are **hardcoded** in the `resize_server.py` script.

| Filepath                     | Description                                                              |
|------------------------------|--------------------------------------------------------------------------|
| `datasets/resize_server.py`  | Iterates through all images and stores resized versions (hardcoded paths) |
| `datasets/split_train_val.py`| Splits the ImageNet-V2 dataset into training and validation sets          |

Run:
```bash
python3 datasets/resize_server.py
python3 datasets/split_train_val.py
```

---

## Inference & Benchmarking

From the project root:
```bash
python3 -m inference.benchmark_test
```

| Script                                    | Arguments            | Purpose                                                                                                                      |
|-------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------|
| `benchmark_test.py`                       | –                    | Calculate predictions for each image in the internal test set, split into subsets, and save results to CSV.                 |
| `benchmark_test_internal.py`              | –                    | Same as `benchmark_test.py` but for the internal configuration/testing.                                                      |
| `benchmark_test_external.py`              | –                    | Same as above for the external test set.                                                                                     |
| `search_embeddings_test_internal.py`      | –                    | Search embeddings in the internal test set.                                                                                  |
| `search_embeddings_test.py`               | –                    | Searches embeddings in the test set (description can be expanded as needed).                                                 |
| `calculate_baseline.py`                   | `INTERNAL`           | Calculate accuracy of models on each subset.                                                                                 |
| `calculate_final_integrals.py`            | `INTERNAL`           | Calculate Normalized Confidence Gain for each subset and model/embedding ensemble.                                           |
| `final_inference_combination.py`          | `INTERNAL`           | Calculate mean accuracy and coverage for the best N model–embedding pairs.                                                   |
| `inference_test.py`                       | `INTERNAL`           | Run inference tests with the internal configuration.                                                                         |
| `plot_combination_confidence_curves.py`   | `INTERNAL`           | Plot confidence curves for each model with embedding ensembles.                                                              |

The argumenrs could be 'INTERNAL' or 'EXTERNAL' depending on the dataset you want to evaluate;
'INTERNAL' -> ImageNetV2 (validation);
'EXTERNAL' -> ObjectNet
--
## Utility Scripts

Under `utils/`:

| Script               | Purpose                                                                                          |
|----------------------|--------------------------------------------------------------------------------------------------|
| `count_rows.py`      | Calculate the number of rows in a CSV file (filename hardcoded in the script).                   |
| `get_class_names.py` | Get ImageNet-1K class names from the dataset.                                                     |

Run:
```bash
python3 utils/count_rows.py
python3 utils/get_class_names.py
```

---



## Project Structure

```
.
├── clickhouse/
│   ├── __init__.py
│   ├── process_embeddings.py
│   └── utils/
│       ├── check_collections.py
│       ├── count_repeated_elements.py
│       ├── count_unique_embs.py
│       ├── test_choose.py
│       └── test_hgs.py
├── config/
│   ├── __init__.py
│   └── config.py
├── adjusting/
│   ├── __init__.py
│   ├── analysis.py
│   ├── analysis_multiple.py
│   ├── benchmark_performance.py
│   ├── calculate_final_integrals.py
│   ├── find_best_integral.py
│   ├── plot_analysis_single.py
│   ├── search_embeddings.py
│   ├── store_best_N.py
│   └── plot_integrals.py
├── inference/
│   ├── __init__.py
│   ├── benchmark_test.py
│   ├── benchmark_test_external.py
│   ├── calculate_baseline.py
│   ├── calculate_final_integrals.py
│   ├── final_inference_combination.py
│   ├── plot_combination_confidence_curves.py
│   └── search_embeddings_test.py
├── orchestrator/
│   ├── __init__.py
│   ├── train.py
│   ├── process_embeddings.py
│   └── test.py
├── utils/
│   ├── __init__.py
│   ├── count_rows.py
│   └── get_class_names.py
├── datasets/
│   ├── __init__.py
│   ├── resize_server.py
│   └── split_train_val.py
├── requirements.txt
└── README.md
```

---

© [2025] [SOME_USER].  
This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).  
You are free to share and adapt the material for any purpose, even commercially, provided that appropriate credit is given, a link to the license is included, and any changes made are indicated.
