# Scalable Utility-Aware Multiclass Calibration

This repository contains the code for the NeurIPS 2025 submission titled "Scalable Utility-Aware Multiclass Calibration".

## Overview

Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. We instead propose *utility calibration*, a general framework designed to evaluate model calibration directly through the lens of downstream applications. This approach measures the calibration error relative to a specific *utility function* that encapsulates the goals or decision criteria relevant to the end user. As such, utility calibration provides a task-specific perspective on reliability. We demonstrate how this framework can *unify and re-interpret several existing calibration metrics*, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and to go beyond such binarized approaches, towards assessing calibration for richer classes of downstream utilities.

## Acknowledgements

Implementations for several standard calibration methods in the `calibration_methods/` directory are adapted from the [calibration-baselines](https://github.com/tiago-salvador/calibration-baselines) repository by Tiago Salvador.

## Prerequisites

* Python 3.x
* PyTorch
* NumPy
* SciPy
* JAX & JAXlib
* Matplotlib
* scikit-learn
* scikit-optimize
* tqdm
* pandas

You can install the necessary packages using pip:
```bash
pip install torch numpy scipy jax jaxlib matplotlib scikit-learn scikit-optimize tqdm pandas
```

## Directory Structure

The project follows a specific directory structure for data and results:

```
code_submit_folder/
├── calibration_methods/           # Implementations of various calibration algorithms
│   ├── ccac.py
│   ├── dirichlet_calibration.py
│   ├── ensemble_temperature_scaling.py
│   ├── irova.py
│   ├── irova_ts.py
│   ├── irm.py
│   ├── irm_ts.py
│   ├── linear_calibrator.py
│   ├── matrix_scaling.py
│   ├── temperature_scaling.py
│   └── vector_scaling.py
├── logits/                          # Root directory for model logits and labels
│   ├── ModelName_DatasetName/       # E.g., ViT_Base_P16_224_ImageNet1k
│   │   ├── {dataset_name}_logits.pt # Pre-computed model logits
│   │   ├── {dataset_name}_labels.pt # Corresponding true labels
│   │   └── results/                 # Directory for storing experiment results
│   │       ├── MethodName/          # Results for each calibration method
│   │       │   ├── test_probs_{split_num}.npy
│   │       │   ├── test_labels_{split_num}.npy
│   │       │   ├── metrics_{split_num}.json
│   │       │   └── ...
│   │       ├── config.json          # Experiment configuration for real_data_calibration.py
│   │       └── final_results.csv    # Aggregated results from real_data_calibration.py
│   └── ...                          # Other model_dataset directories
├── experiment_cdf_data/             # Root directory for ECDF data logs
│   ├── ModelName_DatasetName/
│   │   └── cdf_plot_logs/           # Stores .npy error distributions and _metadata.json
│   │       ├── MethodName_linear_errors.npy
│   │       ├── MethodName_rank_errors.npy
│   │       └── _metadata.json
│   └── ...
├── ecdf_plots_from_logs/            # Output directory for plots from plot_ecdf_from_logs.py
│   ├── ModelName_DatasetName/
│   │   └── {dataset_name}_{model_name}_combined_uc_ecdf.pdf
│   └── ...
├── ecdf_plots_comparison/           # Output directory for plots from compare_models_ecdf.py
│   └── model_comparison_{MODEL_LEFT}_vs_{MODEL_RIGHT}.pdf
├── compute_ecdf_data.py             # Script to compute ECDF data from existing experiment results
├── compare_models_ecdf.py           # Script to generate side-by-side ECDF comparison plots for two models
├── plot_ecdf_from_logs.py           # Script to plot ECDF curves from saved .npy log files
├── post_hoc.py                      # Script for running post-hoc utility calibration experiments
├── process_results.py               # Script to format CSV results from recompute_metrics.py
├── real_data_calibration.py         # Script to run various standard calibration methods
├── recompute_metrics.py             # Script to recompute metrics from saved .npy probability files
├── uncertainty_measures.py          # Utility script for various uncertainty and calibration metrics
└── utility_cal.py                   # Core utility calibration functions
```

## Data Preparation

### Logits and Labels:
1. Create a root directory named `logits`.
2. Inside `logits`, create a subdirectory for each model-dataset combination you want to evaluate (e.g., `ResNet20_CIFAR100`, `ViT_Base_P16_224_ImageNet1k`).
3. In each `ModelName_DatasetName` directory, place the pre-computed logits and labels as PyTorch tensor files (`.pt`).
   - Logits file should be named `{dataset_name}_logits.pt` (e.g., `cifar100_logits.pt`, `imagenet_logits.pt`). This file should contain a 2D tensor of shape `(n_samples, n_classes)`.
   - Labels file should be named `{dataset_name}_labels.pt` (e.g., `cifar100_labels.pt`, `imagenet_labels.pt`). This file should contain a 1D tensor of shape `(n_samples,)` with integer class labels.
4. The script `real_data_calibration.py` (and others that load data) infers the `dataset_name` (e.g., "cifar100", "cifar10", "imagenet") from the `ModelName_DatasetName` directory name.

## Running Experiments and Reproducing Results

The following scripts are the main components for running experiments and generating the figures and tables presented in the paper.

### 1. Standard Calibration Methods Evaluation

The `real_data_calibration.py` script evaluates various standard post-hoc calibration methods.

**Purpose:** Loads model logits and labels, splits them into calibration and test sets over multiple runs, applies different calibration methods, evaluates them, and saves probabilities, labels, and metrics.

**Input:** Requires model logits and labels to be placed in the `logits/ModelName_DatasetName/` directory as described in "Data Preparation".

**Output:**
- Creates a `results/` subdirectory within each `logits/ModelName_DatasetName/` directory.
- Inside `results/`, it creates subdirectories for each calibration method (e.g., `Uncalibrated`, `TemperatureScaling`, `VectorScaling`, etc.).
- For each method and each split, it saves:
  - `val_probs_{split_num}.npy`, `test_probs_{split_num}.npy`: Calibrated probabilities for validation and test sets.
  - `val_labels_{split_num}.npy`, `test_labels_{split_num}.npy`: Corresponding labels.
  - `metrics_{split_num}.json`: Evaluation metrics for the split.
- It also saves `split_{split_num}_val_indices.npy` and `split_{split_num}_test_indices.npy` in the `results/` directory for reproducibility of data splits.
- A `config.json` file detailing the experiment parameters.
- An aggregated `final_results.csv` file in the `results/` directory, summarizing metrics (mean ± std) across splits for all methods.

**Usage:**
```bash
python real_data_calibration.py --model-path ./logits/YourModel_YourDataset --n-splits 5 --val-ratio 0.7
```
- `--model-path`: Path to the specific model-dataset directory (e.g., `./logits/ResNet20_CIFAR100`).
- `--n-splits`: Number of random train/test splits to evaluate (default: 5).
- `--val-ratio`: Ratio of data for the calibration set (default: 0.7).
- `--output`: Path for the output summary CSV (default: `calibration_results.csv` inside results_dir).

### 2. Post-Hoc Utility Calibration (Iterative Patching)

The `post_hoc.py` script implements and evaluates the iterative utility calibration method described in the paper (referred to as "Patch" in the paper's results).

**Purpose:** Loads model logits and labels, splits data, and applies the iterative utility calibration algorithm. It saves calibrated probabilities and metrics.

**Input:** Similar to `real_data_calibration.py`, requires logits and labels in `logits/ModelName_DatasetName/`.

**Output:**
- Creates method-specific subdirectories under `logits/ModelName_DatasetName/results/` (e.g., `PostHocUC_Union_iters125_step_fixed_0.01_sub500`). The name depends on the chosen parameters.
- Saves `test_probs_{split_num}.npy`, `test_labels_{split_num}.npy`, and `metrics_{split_num}.json` for each split.
- Saves an aggregated summary CSV (e.g., `summary_posthoc_uc_ModelName_DatasetName_TIMESTAMP.csv`) in the `results/` directory.

**Usage:**
```bash
python post_hoc.py --model-path ./logits/YourModel_YourDataset \
                   --n-splits 5 \
                   --val-ratio 0.7 \
                   --max-iters-cal 125 \
                   --n-samples-update-iter 500 \
                   --utility-types union \
                   --stepsize-type fixed \
                   --fixed-stepsize-value 0.01 \
                   --verbose
```
- `--model-path`: Path to the model-dataset directory.
- `--n-splits`: Number of random splits (default: 5).
- `--val-ratio`: Ratio for calibration set (default: 0.7).
- `--n-samples-cal-overall`: Number of samples for the initial calibration split (default: all of val_ratio).
- `--n-samples-update-iter`: Number of samples from calibration set for each iteration's update (default: 500).
- `--max-iters-cal`: Max iterations for the calibrator (default: 125).
- `--print-every-iters`: Print metrics every N iterations during fitting (default: 10).
- `--stepsize-type`: Type of stepsize calculation (`dynamic_alpha`, `dynamic_C`, `fixed`) (default: `fixed`).
- `--fixed-stepsize-value`: Value for fixed stepsize (default: 0.01).
- `--utility-types`: List of utility classes for calibrator (`cw`, `tk`, `union`) (default: `['union']`).
- `--verbose`: Enable verbose output.

### 3. Recomputing Metrics from Saved Probabilities

The `recompute_metrics.py` script can be used to recalculate evaluation metrics from the `.npy` files of probabilities and labels saved by other scripts (e.g., `real_data_calibration.py` or `post_hoc.py`). This is useful if you want to apply different evaluation criteria or debug metrics without re-running the entire calibration process.

**Purpose:** Loads saved test probabilities and labels for each method and split, recomputes metrics (including utility calibration measures), and saves/prints an aggregated summary.

**Input:** Expects the output structure created by `real_data_calibration.py` or `post_hoc.py` (i.e., `logits/ModelName_DatasetName/results/MethodName/test_probs_*.npy` and `test_labels_*.npy`).

**Output:**
- Prints an aggregated summary table to the console.
- Saves a CSV file (e.g., `recomputed_metrics_TIMESTAMP.csv`) in the `logits/ModelName_DatasetName/` directory.

**Usage:**
```bash
python recompute_metrics.py --model-dir ./logits/YourModel_YourDataset \
                            --output ./logits/YourModel_YourDataset/recomputed_summary.csv \
                            --uc-max-samples 1000 \
                            --uc-num-subsamples 5
```
- `--model-dir`: Path to the model-dataset directory containing the `results/` subdirectory.
- `--output`: Path for the output CSV file (default: `model_dir/recomputed_metrics_{timestamp}.csv`).
- `--method-name`: Specific method directory name to process (optional).
- `--uc-disable-subsampling`: Disable subsampling for utility calibration metrics.
- `--uc-max-samples`: Max samples per subsample for UC metrics (default: 1000).
- `--uc-num-subsamples`: Number of subsamples for UC metrics (default: 5).
- `--uc-seed`: Random seed for UC subsampling (default: 42).

### 4. Generating ECDF Data

The `compute_ecdf_data.py` script processes the results of calibration experiments (specifically the `.npy` files containing test probabilities and labels, or the original logits for the 'Uncalibrated' case) to compute and save the utility calibration error distributions necessary for plotting eCDFs.

**Purpose:** For a given model and its calibration methods, this script aggregates data across splits (if applicable) and computes the linear and rank-based utility calibration error distributions. These distributions are saved as `.npy` files.

**Input:**
- Model logits and labels from `logits/ModelName_DatasetName/` for the 'Uncalibrated' method.
- `test_probs_{split_num}.npy` and `test_labels_{split_num}.npy` files from `logits/ModelName_DatasetName/results/MethodName/` for calibrated methods.
- `split_*_test_indices.npy` files from `logits/ModelName_DatasetName/results/` to determine which data splits to use.

**Output:**
- Creates an output directory structure: `output_log_parent_dir/ModelName_DatasetName/cdf_plot_logs/`.
- Inside `cdf_plot_logs/`, it saves:
  - `{MethodName}_linear_errors.npy`: Array of linear utility calibration errors.
  - `{MethodName}_rank_errors.npy`: Array of rank-based utility calibration errors.
  - `_metadata.json`: A JSON file containing metadata about the data generation process (e.g., number of splits, number of utility samples).

**Usage:**
```bash
python compute_ecdf_data.py --model-dir ./logits/YourModel_YourDataset \
                            --output-log-parent-dir ./experiment_cdf_data \
                            --target-split-num 1 \
                            --num-utility-samples 500 \
                            --uc_utility_sample_chunk_size 50
```
- `--model-dir`: Path to the main model directory (e.g., `./logits/ResNet56_CIFAR100`).
- `--output-log-parent-dir`: Parent directory to save ECDF data logs (default: `./experiment_cdf_data`).
- `--max-splits`: Max number of data splits to evaluate (default: 1, 0 for all). Ignored if `--target-split-num` is set.
- `--target-split-num`: Specify a single split number to process (e.g., 1 for `split_1_...`). If set, `--max-splits` is ignored.
- `--num-utility-samples`: Number of utility vectors to sample for ECDF distributions (default: 500).
- `--jax-seed`: Seed for JAX PRNGKey (default: 42).
- `--uc_data_subsample_trigger_n`: Data size threshold to trigger subsampling in ECDF (default: 5500).
- `--uc_data_subsample_batch_n`: Target batch size for data subsampling in ECDF (default: 5000).
- `--uc_data_num_subsample_batches`: Number of subsample batches if triggered (default: 5).
- `--uc_data_subsampling_seed`: Seed for data subsampling (default: 42).
- `--uc_utility_sample_chunk_size`: Chunk size for utility samples to avoid JAX memory issues (default: 50).

### 5. Plotting ECDF Curves from Logs

The `plot_ecdf_from_logs.py` script generates eCDF plots for a single model using the `.npy` error distribution files created by `compute_ecdf_data.py`.

**Purpose:** Loads the saved `MethodName_linear_errors.npy` and `MethodName_rank_errors.npy` files and generates a combined plot (linear and rank-based eCDFs side-by-side) for a specified model.

**Input:** `.npy` files from the `experiment_cdf_data/ModelName_DatasetName/cdf_plot_logs/` directory.

**Output:**
- Saves a PDF plot (e.g., `{dataset_name}_{model_name}_combined_uc_ecdf.pdf`) in a subdirectory named after the model within the `--output-plot-dir` (default: `./ecdf_plots_from_logs/ModelName_DatasetName/`).

**Usage:**
```bash
python plot_ecdf_from_logs.py --logs-root-dir ./experiment_cdf_data \
                              --model-name YourModel_YourDataset \
                              --output-plot-dir ./ecdf_plots_from_logs
```
- `--logs-root-dir`: Path to the root directory containing model-specific ECDF data (e.g., `./experiment_cdf_data`).
- `--model-name`: Specific model name to generate plots for. If empty, processes all models in logs-root-dir.
- `--output-plot-dir`: Base directory to save plots (default: `./ecdf_plots_from_logs`).

### 6. Comparing Models with ECDF Plots

The `compare_models_ecdf.py` script generates a side-by-side comparison plot (1x4 subplots: Model1-Rank, Model1-Linear, Model2-Rank, Model2-Linear) for two specified models using their ECDF data.

**Purpose:** Facilitates direct visual comparison of calibration performance (linear and rank-based utility error eCDFs) between two models.

**Input:** `.npy` files from `experiment_cdf_data/ModelName_DatasetName/cdf_plot_logs/` for the two models specified in the script's global variables (`MODEL_LEFT`, `MODEL_RIGHT`).

**Output:**
- Saves a PDF plot (e.g., `model_comparison_MODEL_LEFT_vs_MODEL_RIGHT.pdf`) in the directory specified by `--output-plot-dir` (default: `./ecdf_plots_comparison/`).

**Configuration:** The two models to compare (`MODEL_LEFT`, `MODEL_RIGHT`) and the methods to include in the plot (`METHODS_TO_PLOT`, `LEGEND_MAPPING`) are defined as global variables within the `compare_models_ecdf.py` script and may need to be edited directly.

**Usage:**
```bash
python compare_models_ecdf.py --logs-root-dir ./experiment_cdf_data \
                              --output-plot-dir ./ecdf_plots_comparison
```
- `--logs-root-dir`: Path to the root directory for ECDF data.
- `--output-plot-dir`: Directory to save the comparison plot.

**Usage:**
```bash
python process_results.py path/to/your_input_results.csv -o path/to/your_formatted_output.csv
```
- `input_file`: Path to the input CSV.
- `--output` or `-o`: Path for the output formatted CSV (optional).

## Core Modules

- **`utility_cal.py`**: Contains the core JAX-based functions for calculating utility calibration errors (linear, rank-based, top-class, class-wise, top-k) and their distributions. It also includes helper functions for data subsampling and combining results.
- **`uncertainty_measures.py`**: Provides functions to compute various standard uncertainty and calibration metrics like ECE (Expected Calibration Error) with equal-width and equal-mass binning, Brier score, NLL, and accuracy.
- **`calibration_methods/`**: This directory houses implementations for various post-hoc calibration algorithms used in the experiments, such as Temperature Scaling, Vector Scaling, Matrix Scaling, Dirichlet Calibration, Ensemble Temperature Scaling, IROvA, and IRM.