# LARP: Learner-Agnostic Robust Data Prefiltering

The code accompanying our AISTATS 2026 submission

## Prerequisites
- Install Anaconda. 
- Create the conda environment:<br>
> conda env create -f larp\_env.yml

- Enable the created environment:<br>
> conda activate larp\_env

- Install the following packages within the environment using pip:<br>
> pip install torch torchvision tensorboard ray\[tune\] wandb openpyxl jax jaxopt joblib

Note that the experiments for Adult dataset and Gaussian mean estimation can be run on CPUs by installing torch and torchvision using the following command:<br>
> pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

The BAF dataset can be downloaded from https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022.

The CIFAR-10N dataset can be downloaded from http://ucsc-real.soe.ucsc.edu:1995/Download.html.

The Tiny ImageNet dataset is handled automatically by the script **tinyimagenet\_label\_reg\_partialds.py**.

## Experiments on Adult and CIFAR-10 datasets presented in Section 5.1
Call the relevant script depending on the specific setting:

- The script **adult\_fixed\_default\_partial.py** is used for conducting LARP instances for the Adult dataset with label noise.
- The script **cifar10\_label\_reg\_partialds.py** is used for conducting LARP instances for the CIFAR-10 dataset with label noise.


### Parameters
- *--eps* - Contamination ratio, e.g. **0.3**. 
- *--annp* - The prefiltering procedure hyperparameter stating what percentage of data is removed, e.g. **75.0**.
- *--run* - The number of the run of the experiment, e.g. **2**.
- *--force* - Boolean parameter stating whether results should be overwritten if they already exist.
- *--size\_percent* - Initial reduction in the dataset, e.g. **75.0**.


### Example Command

- To run our setup on CIFAR-10 with 30% label noise, initial dataset reduction to 70%, and prefiltering such that 80% of the data passes through the filter, you can run the following command:<br>
> python cifar10\_label\_reg\_partialds.py --eps=0.3 --annp=80.0 --size\_percent=70.0


## Experiments on the BAF dataset presented in Section 5.2
The script **rebalancing\_grid\_baf.py** is used for conducting LARP instances for the BAF dataset.
Note that this conducts a grid search on alpha, beta and gamma, and returns MCC and DI ratio for each triplet.
This can later be used to find the combined loss for each downstream learner and find the price of LARP.


### Parameters
- *--csv* - Path to dataset CSV, e.g. **Base.csv**. 
- *--label-col* - Name of label column, e.g. **fraud\_bool**.
- *--age-col* - Name of sensitive age column, e.g. **customer\_age**.
- *--unfavourable-value* - Unfavourable label value, e.g. **1**.
- *--random_seed* - Random Seed, e.g. **42**.
- *--size\_percent* - Initial reduction in the dataset, e.g. **75.0**.
- *--out* - Path of output file, e.g. **output.csv**.


### Example Command

- To run our setup on BAF with the dataset stored as *Base.csv*, the label column named *fraud_bool*, the sensitive column named *customer_age*, the unfavourable value being 1, initial dataset reduction to 70%, and name of the output file *abg_rf_grid_results.csv* you can run the following command:<br>
> time python rebalancing\_grid\_baf.py --csv=Base.csv --label-col=fraud\_bool --age-col=customer\_age --unfavourable-value=1 --out=abg\_rf\_grid\_results.csv


## Experiments on Gaussian mean estimation presented in Supplementary material

The script **exp7\_huber2joint.py** is used for conducting the Gaussian mean estimation experiments presented in the Supplementary material.

### Parameters:
The first parameter describes the contamination ratio, the second parameter describes the parameter of the second Huber learner (the first Huber learner is set at delta=0.01), the third parameter describes which prefiltering should be used("sdo", "zscore" or "quantile").

### Example command:
To run our setup with 20% contamination ratio, Huber parameter of the second learner equal to 3.0, and moderation using SDO, you can run the following command:<br>
> python exp7\_huber2joint.py 0.2 3.0 sdo

## Experiments on the Adult dataset presented in Supplementary material
Call the relevant script depending on the specific setting:

- The script **adult\_shortcut\_default\_partial.py** is used for conducting LARP instances for the Adult dataset with shortcuts.
- The script **adult\_fixed\_default\_partial\_oracle.py** is used for conducting LARP instances for the Adult dataset with oracle prefiltering.
- The script **adult\_fixed\_default\_partial\_conflearn.py** is used for conducting LARP instances for the Adult dataset with prefiltering based on Confident Learning.


### Parameters
- *--eps* - Contamination ratio, e.g. **0.3**. 
- *--annp* - The prefiltering procedure hyperparameter stating what percentage of data is removed, e.g. **75.0**.
- *--run* - The number of the run of the experiment, e.g. **2**.
- *--force* - Boolean parameter stating whether results should be overwritten if they already exist.
- *--t* - Efficiency of the oracle prefiltering procedure. (Relevant to **adult\_fixed\_default\_partial\_oracle.py**).
- *--size\_percent* - Initial reduction in the dataset, e.g. **75.0**.

### Example Command

- To run our setup on CIFAR-10 with 85% shortcuts, and prefiltering using our custom CNN, you can run the following command:<br>
> python adult\_fixed\_default\_partial\_oracle.py --eps=0.85 --annp=80.0

## Experiments on the CIFAR-10 dataset presented in Supplementary material
Call the relevant script depending on the specific setting:

- The script **cifar10\_shortcut\_color\_patchreg\_partialds.py** is used for conducting LARP instances for the CIFAR-10 dataset with shortcuts.
- The script **cifar10\_label\_reg\_partialds\_conflearn.py** is used for conducting LARP instances for the CIFAR-10 dataset with prefiltering based on Confident Learning.
- The script **cifar10\_label\_reg\_partialds\_human.py** is used for conducting LARP instances for the CIFAR-10N dataset.
- The script **cifar10\_label\_reg\_partialds\_oracle.py** is used for conducting LARP instances for the CIFAR-10 with oracle prefiltering.
- The script **cifar10\_shortcut\_color\_patchreg\_ttll.py** is used for conducting LARP instances for the CIFAR-10 dataset with shortcuts and various prefiltering procedures.
- The script **cifar10\_shortcut\_color\_pweight\_partialds.py** is used for conducting LARP instances for the CIFAR-10 dataset with shortcuts and learner set parametrized by loss reweighting.

### Parameters
- *--eps* - Contamination ratio, e.g. **0.3**. 
- *--annp* - The prefiltering procedure hyperparameter stating what percentage of data is removed, e.g. **75.0**.
- *--run* - The number of the run of the experiment, e.g. **2**.
- *--force* - Boolean parameter stating whether results should be overwritten if they already exist.
- *--prefilter\_model* - Whether the prefiltering model should be our custom CNN or a small FFNN. Acceptable values are **ann5** and **cnn5**. (Relevant to **cifar10\_shortcut\_color\_patchreg\_ttll.py**).
- *--t* - Efficiency of the oracle prefiltering procedure, e.g. **0.75**. (Relevant to **cifar10\_label\_reg\_partialds\_oracle.py**).
- *--size\_percent* - Initial reduction in the dataset, e.g. **75.0**.

### Example Command

- To run our setup on CIFAR-10 with 85% shortcuts, and prefiltering using our custom CNN, you can run the following command:<br>
> python cifar10\_shortcut\_color\_patchreg\_partialds.py --eps=0.85 --annp=80.0 --size\_percent=70.0

## Experiment on the Tiny ImageNet dataset presented in Supplementary material
Call the relevant script depending on the specific setting:

- The script **tinyimagenet\_label\_reg\_partialds.py** is used for conducting LARP instances for the Tiny ImageNet dataset with label noise.

### Parameters
- *--eps* - Contamination ratio, e.g. **0.3**. 
- *--annp* - The prefiltering procedure hyperparameter stating what percentage of data is removed, e.g. **75.0**.
- *--run* - The number of the run of the experiment, e.g. **2**.
- *--force* - Boolean parameter stating whether results should be overwritten if they already exist.
- *--size\_percent* - Initial reduction in the dataset, e.g. **75.0**.

### Example Command

- To run our setup on Tiny ImageNet with 30% label noise, and prefiltering such that 80% of the data passes through the filter, you can run the following command:<br>
> python tinyimagenet\_label\_reg\_partialds.py --eps=0.3 --annp=80.0 --size\_percent=70.0



