# Combining-Machine-Learning-Pipelines-with-Monoids-Supplement

Supplemental material for paper submission
"Training and Cross-Validating Machine Learning Pipelines with Limited Memory".

The implementation of the algorithms described in the paper is
available as part of an open-source project under Apache 2.0 license. URL of the project is
not included due to double-blind review.
The directory `rasl_experiments` in this supplemental material provides the scripts for
running the experiments described in the paper.

Running these requires installing the dependencies included in requirements.txt by doing
`pip install -r requirements.txt` in a new Python environment (we have tested with Python 3.8 and 3.9). 

#### RQ1. Can batching enable fitting pipelines on larger data without Spark SQL?

For RQ1, the datasets need to be explicitly downloaded, all of them are publicly available and referenced from the paper.
Then train-test splits need to be created using [split_csv.py](rasl_experiments/split_csv.py). 
Before running the split_csv.py, you would need to set `dataset_base_dir`, `dataset_name`, and `label_column` and `columns_to_drop` if applicable. The script has information for using the
[KDDCup99_full](https://www.openml.org/search?type=data&sort=runs&id=1110&status=active) dataset.

Once the splits are ready, the code to run the experiment is [run_large_datasets.py](rasl_experiments/run_large_datasets.py).
This script expects the following command line arguments:
* `dataset: string`, dataset to use for the experiment. The dataset names are defined in the list `datasets` in [config.py](rasl_experiments/config.py). Also set the `large_datasets_dir` to point to the datasets home directory in [config.py](rasl_experiments/config.py).
* `expt_type: int`, experiment setting to use for the experiment. 
Pass 0 for sklearn setting and 1 for rasl batching.
* `batch_size: int`, batch size i.e. number of rows in a batch.
* `process_memory_limit: int`, process_memory_limit in bytes, this is the memory
restriction to simulate a small memory setting. Values used per dataset are documented in the results section of the paper.
* `max_resident: int`, memory in bytes, this is the memory limit that the task graph assumes for spilling. For the experiments, we used about 1/3rd of the process_memory_limit.
* `num_runs: int`, number of runs in case of multiple runs.

For example, it can be called as follows:
```
 python run_large_datasets.py kddcup99full 0 10000 10000000000 4000000000 1
```


#### RQ2. When should you use the pandas backend and when the Spark SQL backend?

See [compare_spark_and_pandas_scaled.py](rasl_experiments/compare_spark_and_pandas_scaled.py).
Note that this experiment needs java for Spark and need to set an environment
variable `JAVA_HOME`. On MacOS, this can be set as follows:
```
export JAVA_HOME=`/usr/libexec/java_home`
```

The experiment can then be run as follows:
```
 python compare_spark_and_pandas_scaled.py
```
This experiment is run for 5 different seeds, which are initialized in the code as:
```
random_seeds = [0,42,90,33,56]
```

#### RQ3. Do the pandas and Spark SQL backends yield identical results to sklearn?

The relevant tests are in the open-source project repository under
test/test_relational_sklearn.py and test/test_relational_from_sklearn_manual.py.

#### RQ4. Does batched execution yield identical results as non-batched?

See [compare_batched_and_non_batched.py](rasl_experiments/compare_batched_and_non_batched.py).

It can be called as follows:
```
 python compare_batched_and_non_batched.py
```
This experiment is run for 5 different seeds, which are initialized in the code as:
```
random_seeds = [0,42,90,33,56]
```
#### RQ5. How much accuracy does partial-transform training lose?

See [partialtfm.ipynb](rasl_experiments/partialtfm.ipynb).

#### RQ6. How effective is out-of-fold cross-validation at picking models?

See [crossval.ipynb](rasl_experiments/crossval.ipynb).
