# Evading Data Contamination Detection for Language Models is (too) Easy
This repo contains the code for our paper, *Evading Data Contamination Detection for Language Models is (too) Easy*. We explain how to install and run the code in this repository to reproduce the results presented in the paper. The code assumes that Cuda is available and that at least 80GB of memory is present on the GPU(s). 

## Installation
You can install the code in this repo by installing [Conda](https://docs.conda.io/projects/miniconda/en/latest/) and running the following commands:

```bash
conda create -n contamination python=3.10
conda activate contamination
python -m pip install -e .
```

If you want to have the exact same versions we used for all packages, you can run the following commands instead:
```bash
conda create -n contamination python=3.10
conda activate contamination
python -m pip install -r requirements.txt
python -m pip install -e .
```

## Reproducing Results

Before starting, you should either add your OpenAI API Key as an environment variable with the key OPENAI_API_KEY or create the [`scripts/.env`] file with the following content:
```bash
OPENAI_API_KEY=[YOUR API KEY]
```

Furthermore, you should replace the Huggingface username used in our files to your own Huggingface username. Specifically, you should change this in the followinf giles:
- [code-contamination-detection/bash.sh](code-contamination-detection/bash.sh) on line 6.
- [scripts/finetune.py](scripts/finetune.py) on line 22.

Finally, you should login to HuggingFace by running 
```bash
huggingface-cli login
```
and following the instructions.


You can reproduce all our results by running the following commands:

```bash
bash scripts/rewrite.sh
bash scripts/clean_eval.sh
bash scripts/check_overlap.sh
bash scripts/finetune.sh
bash code-contamination-detection/bash.sh
```

We note that this last command implement the benchmark-level detection approach by Shi and the subfolder was copied from their [GitHub](https://github.com/swj0419/detect-pretrain-code-contamination) repo. Changes that were made were documented with the a comment starting with `# NOTE`. All other methods were implemented in the [`src/contamination`](/src/contamination/) folder.

After this, the notebook [`notebooks/postprocess.ipynb`](notebooks/postprocess.ipynb) can be run to obtain extracted results for each model.

Note that fully reproducing the results on a single H100 Nvidia GPU can take several weeks.