# What Does It Take to Build a Performant Selective Classifier?

<strong>📄 Paper URL: ...</strong>

## 🧠 Abstract

Selective classifiers improve reliability by abstaining on uncertain inputs, yet their performance often lags behind the perfect-ordering oracle that accepts examples in exact order of correctness. We formulate this shortfall as a coverage-uniform selective-classification gap and prove the first finite-sample decomposition that pinpoints five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation or shift-induced slack. Our bound shows that monotone post-hoc calibration cannot reduce the gap, as it preserves the original score ordering; closing the gap therefore requires scoring mechanisms that can modify the ranking induced by the base model. We validate our gap decomposition on synthetic two-moons data and real-world vision benchmarks, isolating each error component via controlled experiments. Results confirm that (i) Bayes noise and limited model capacity alone explain large gaps, (ii) only non-monotone or feature-aware calibrators shrink the ranking term, and (iii) distribution shift adds a distinct slack that must be addressed by robust training. Our decomposition supplies a quantitative error budget and concrete design guidelines for building selective classifiers that approach ideal oracle behavior.

## ⚙️ Installation with `uv`

We are using [`uv`](https://github.com/astral-sh/uv) as our package manager. It is a fast Python dependency management tool and drop-in replacement for `pip`.

### Step 1: Install `uv` (if not already installed)

```bash
pip install uv
```

### Step 2: Install dependencies 

```bash
uv pip install -e .
```

### Step 3: Activate environment 

```bash
source .venv/bin/activate
```

## 🗂️ Codebase overview
**Training:**
- `train_main.py`: Trains a standard model from scratch across datasets and selective prediction methods.
- `train_cifar_n.py` Trains a model on the CIFAR-10N/100N datasets.
- `train_lp.py`: Trains a loss predictor on top of penultimate layer representations.

**Evaluation:**
- `eval_arch.py`: Evaluation across model architectures.
- `eval_cifar_c.py`: Evaluation on the CIFAR-10C/100C datasets.
- `eval_cifar_n.py`: Evaluation on the CIFAR-10N/100N datasets.
- `eval_shift.py`: Evaluation on real-life shifts.

**General:**
- `synth_exp_plot.ipynb`: Notebook used for synthetic experiments and general plotting.
