# Conformal Data Contamination Tests for Trading or Sharing of Data
This repository contains the code library used to create the results of the paper `Conformal Data Contamination Tests for In-distribution Data Acquisition` submitted to ICLR'26.

## Contents
### Modules
- Benjamini_Hochberg.py `Base functionality to run adaptive Benjamini-Hochberg procedure.`
- ConformalContaminationTestModule.py `Class for computing conformal data contamination test statistics and p-values.`
- ConformalScoreModule.py `Class for computing conformal scores.`
- DataHandlerModule.py `Class for loading, organizing, and sampling the data.`
- SupervisedMachineLearningModule.py `Class for fitting and evaluating classifiers.`
- utilities.py `Various functionality used in main scripts.`
- autoencoder.py `Base implementation of an autoencoder.`

### Simulation scripts
- ProposedAccuracy.py `Run a simulation study with the proposed procedure and the baselines - See Section 4 and Section S4.`
- ProposedAccuracyCV.py `Run a simulation study with the proposed procedure and the baselines selecting hyperparameters based on the data in the first round - See Section S4.4.`
- ScoringAnalysis.py `Run a simulation study with the proposed procedure and the baselines evaluating only the conformal data contamination tests - See Section 4 and Section S4.`

### Recreating figures and table
- Figure 2: ScoringAnalysisResults/ScoringResultsLoad.py
- Table 1: ScoringAnalysisResults/ScoringBH.py
- Figure S5: ScoringAnalysisResults/CODplot.py
- Tables S3-S5: ScoringAnalysisResults/AUCtables.py
- Table S6: ScoringAnalysisResults/TDRtables.py
- Tables S7-S8: ScoringAnalysisResults/FDRtables.py
- Figure 3 & Figures S6-S7 & Table S9: ProposedAccuracyResults/ProposedAccuracyLoad.py
- Table S10: ProposedAccuracyCVResults/ProposedAccuracyCVLoad.py

## Software Setup

### Python dependencies
```
python 3.12.4
numpy 2.0
matplotlib 3.9.1
pandas 2.2.2
scipy 1.14
scikit-learn 1.5.1
tensorflow 2.11.0
```

## Data
This folder includes the retinal fundus image data and the MNIST data. The data for the other examples are not included here due to space limitations. Additionally, scripts and data results to generate Figures 2 and S5 as well as tables 1 and S3-S8 are not included due to space limitations.
