# Tabular Deep SMOTE
## The model
Tabular deep SMOTE (TD-SMOTE) is an AutoEncoder based model dedicated to minority oversampling
for imbalanced tabular dataset classification purposes.

<img src="tdsmote_overview.png" width="900">

## Setup

1) `git clone <Anonymize>`  
2) `virtualenv venv --python=python3.8`  (3.8.17)
3) activate virtual env:  
    linux: `source ./venv/bin/activate`  
    windows: `.\venv\Scripts\activate`  
4) update torch version if NVIDIA-GPU is available:  (no GPU skip)  
    run `nvidia-smi` to find CUDA version  
    find applicable torch version in - [CUDA Version](https://pytorch.org/get-started/locally/)  
    install the relevant torch versio, e.g. `pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu117`.  
    comment out the torch in `requirements.txt`
5) Install dependencies:  
    linux: `pip3 install -r requirements.txt`  
    windows: `pip3 install -r requirements.txt --use-feature=2020-resolver`  
    OR: `python setup.py install`

## Dataset Preparation
### Keel Datasets -
1) Download the desired datasets as "5-fcv" zipped files from - https://sci2s.ugr.es/keel/imbalanced.php?order=ins#sub30.
2) Save each zipped file under - `datasets/Keel/raw/<dataset_name>`.  
   For example - `datasets/Keel/raw/glass4/glass4-5-fold.zip`. 
3) Run `python3 preprocess_keel.py` under the `datasets` directory.

### Imblearn Datasets
Run `python3 preprocess_imblearn.py` under the `datasets` directory.

## Running
The TDSMOTE class can be imported directly from `model.py` as described below.  
Another example can be found at `tabular_deep_smote_wrapper.py` which is a wrapper that includes:  
* Display dataset characteristics  - imbalance ratio in test / train et cet.
* HP search for pivotal parameters (using Optuna).
* 2D/3D PCA visualizations of dataset and the synthetic samples in the latent and original space.
* 2D UMAP visualization of dataset with synthetic samples in the original space.  
  
Run the wrapper (`tabular_deep_smote_wrapper.py`)-  
`python3 -m tabular_deep_smote.tabular_deep_smote_wrapper <-flag=...> <-flag=...>`

## Example
```
    from models import TDSMOTE
    model = TDSMOTE(dataset_name, categorical_features, ...)
    train_results = model.fit(train_data, validation_data, best_checkpoint_save_path)
    oversample_results = model.oversample(data, new_minority_pt, oversample_ratio)  # generated minority samples are stored as a torch.tensor (new_minority_pt)
``` 
Where:
1) All data objects (`train_data`, `validation_data`, `data`) are tuples of (values, labels) where each is either a torch.tensor, numpy array or pandas.DataFrame.
2) `train_result` is an instance of:
```
class TrainResults:
    def __init__(self, best_epoch, best_losses):
        self.best_epoch = best_epoch
        self.best_losses = best_losses
```
3) `oversample_results` is an instance of:
```
class OversampleResult:
    def __init__(self, x_all, y_all, x_gen, interpolation):
        self.x_all = x_all
        self.y_all = y_all
        self.x_gen = x_gen
        self.interpolation = interpolation
```

## Run full experiment
Run `python3 -m experiments.run_experiment` from the top directory (`td-smote`).  
Classifier type and other experiment configurations can be made through the `experiment_settings.py` file.  
Results are saved under the - `experiments/results` directory.
