# Deep Thinking on Out-of-Distribution Data: How can we know when a model is overthinking?

This repository contains the official implementation of our ICLR paper **"Deep Thinking on Out-of-Distribution Data: How can we know when a model is overthinking?"**

> 📋 This work builds upon the Deep Thinking framework ([original repo](https://github.com/aks2203/deep-thinking)) and proposes a method for learning to halt overthinking dynamically using a self-supervised auxiliary task.
>
> 🎯 Checkpoints and scripts are provided for training and testing models on CIFAR10-C, CIFAR100-C, and Tiny ImageNet-C.

---

## Requirements

We recommend Python 3.8.2. To install required packages:

```bash
pip install -r requirements.txt
```

Create the data folder and download corrupted CIFAR datasets:

```bash
mkdir data
cd data
```

Manually download [CIFAR10-C](https://zenodo.org/record/2535967), [CIFAR100-C](https://zenodo.org/record/3555552) and [Tiny ImageNet C](https://www.kaggle.com/datasets/luckyhathaway/tiny-imagenet-c), then place the extracted folders inside `data/`.

---

## Training

To train a model on CIFAR10-C, use:

```bash
bash launch/cifar_c_experiments.sh
```

After training, output files will be structured as:

```
outputs/
└── training_<exp_name>/
    └── training-<timestamp>/
        ├── model_best.pth
        ├── model_ssh_best.pth
        ├── stats.json
        ├── train.log
        └── .hydra/
            ├── config.yaml
            ├── hydra.yaml
            └── overrides.yaml
```

---

## Evaluation

To evaluate a saved model, edit the `model_path` variable in:

```bash
launch/test_cifar_c.sh
```

Then run:

```bash
bash launch/test_cifar_c.sh
```

Evaluation output is saved to:

```
outputs/
└── <exp_name>/
    └── testing-<timestamp>/
        ├── stats.json
        ├── test.log
        ├── visualize_acc.png
        └── .hydra/
```

---

## Pre-trained Models

We provide pretrained models for CIFAR10-C, CIFAR100-C, and Tiny ImageNet-C.
You can see them from ```checkpoints```.

---

## Results

Our Conv-GRU model outperforms strong baselines such as ViT and ResNet-30 on multiple OOD corruption benchmarks with fewer parameters.

### CIFAR10-C, Level 5 (Peak Accuracy %)

| Model      | Mean Accuracy |
| ---------- | ------------- |
| ViT        | 51.7          |
| ResNet-30  | 61.9          |
| Conv-LiGRU | 63.5          |
| Conv-GRU   | **67.0**      |

> For detailed results across all 15 corruption types, see `Appendix: Table~\ref{appendix:acc_cifar10c}` in the paper.

Our Conv-GRU model outperforms strong baselines such as ViT and ResNet-30 on multiple OOD corruption benchmarks with fewer parameters.

### CIFAR10-C, Level 5 (Peak Accuracy %)

| Model      | Mean Accuracy |
| ---------- | ------------- |
| ViT        | 26.8          |
| ResNet-30  | 22.6          |
| Conv-LiGRU | 28.3          |
| Conv-GRU   | **32.4**      |

> For detailed results across all 15 corruption types, see `Appendix: Table~\ref{appendix:acc_cifar100c}` in the paper.

---

## Citation

If you find our work useful, please cite:

```bibtex
@inproceedings{deepthinking2025,
  title={Deep Thinking on Out-of-Distribution Data: How can we know when a model is overthinking?},
  author={},
  booktitle={},
  year={}
}
```

---

## Acknowledgements

This repository is adapted from [Deep Thinking (original)](https://github.com/aks2203/deep-thinking) by Schwarzschild et al. under MIT license.

