# CustoVerse Benchmarking

Implementation of the benchmarking code used in our paper:

[**Autoencoder-Based General-Purpose Representation Learning for Entity Embedding**]\

This repository benchmarks six Autoencoder architectures on 13 different tabular datasets with respect to reconstruction performance and accuracy on downstream prediction tasks of embeddings versus original data. 

## Repo Structure

- **artifacts**: This folder has to be created for data and artifacts that are downloaded or generated such as datasets, tuning results, model checkpoints and benchmarking results. 
- **benchmarking**: Holds code for running batched benchmarks for reconstruction quality and downstream prediction performance. This will be the primary way for interacting with the code using the CLI. For more details see the [benchmarking README](benchmarking/README.md). 
- **models**: Contains implementations for all six Autoencoder models as well as a script for training and a CLI tool for hyperparameter tuning. For more details see the [models README](models/README.md). 


## Getting Started
First, navigate to the repo. 
```
cd custo-verse-benchmarking
```
Next, install all required packages via pip or conda:
```
pip install -r requirements.txt
```
For detailed instructions on how to run benchmarks refer to the [benchmarking README](benchmarking/README.md). To run all benchmarks simply use the following command:
```
python -m benchmarking.benchmarking
```

## Autoencoder Architectures

For a list of all Autoencoder architectures used in the benchmarks as well as additional details refer to the [models README](models/README.md).



## Datasets

| Name              | Description                                     | Type           | # Raw Features | # Processed Features |
|-------------------|-------------------------------------------------|----------------|----------------|----------------------|
| [Abalone](https://archive.ics.uci.edu/dataset/1/abalone)             | Predicting the age of abalone from physical measurements      | regression | 8            | 11                  |
| [Adult](https://archive.ics.uci.edu/dataset/2/adult)             | Predict if adult income exceeds 50k a year      | classification | 14             | 107                  |
| [Air Quality](https://www.kaggle.com/datasets/stealthtechnologies/preeeeeeee)             | Predict PM2.5 amount in Beijing air      | regression | 12             | 15                  |
| [Bank Marketing](https://archive.ics.uci.edu/dataset/222/bank+marketing)     | Customer reaction to a banks marketing campaign | classification | 16             | 46                   |
| [California Housing](https://www.kaggle.com/datasets/camnugent/california-housing-prices) | Predict behavior to retain customers                | regression     | 10             | 14                   |
| [Churn Modelling](https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling)    | Churn prediction for a banks customers          | classification | 14             | 2947                 |
| [Customer Retention Retail](https://www.kaggle.com/datasets/uttamp/store-data)         | Marketing effects on customer behaviour         | classification | 15             | 32                   |
| [Parkinsons](https://archive.ics.uci.edu/dataset/189/parkinsons+telemonitoring)         | Predict UPDRS scores in Parkinsons patients       | regression | 12             | 15                   |
| [Shoppers](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset)         | Online shoppers purchasing intention            | classification | 17             | 20                   |
| [Students](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success)          | Predict students' dropout and academic success  | classification | 36             | 40                   |
| [Support2](https://archive.ics.uci.edu/dataset/880/support2)          | Predict death of hospitalized patients          | classification | 42             | 72                   |
| [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) | Housing price prediction task                   | classification     | 21             | 47                   |
| [Walmart](https://www.kaggle.com/datasets/yasserh/walmart-dataset) | Predict weekly store sales in stores             | regression     | 6             | 6                   |

## License

This project is licensed under the Apache-2.0 License.
