# An Efficient Protocol for Distributed Column Subset Selection in the Entrywise $\ell_p$ Norm

## Requirements

The dependencies are specified in requirements.txt.

To install requirements, follow these steps:
1. Run "sudo apt-get update"
2. Clone this repository.
3. Install pip3 by running "sudo apt-get install python3-pip"
4. Run "pip3 install -r requirements.txt"

Beyond this, you may have to install MOSEK (a solver used by CVXPY).
This is need to perform l1 regression to evaluate the error of our
protocol. MOSEK can be downloaded at https://www.mosek.com/downloads/
and offers free academic licenses.

## Running Our Experiments (Obtaining Data)

# Synthetic Data
To run the synthetic data experiments included in the main paper, run

> python3 dist_greedy_baseline.py --main_paper 1 --test_rank k --num_trials t

to generate our synthetic dataset with the parameter k (if it is not
generated already --- otherwise it is not generated again) for t trials.
The results will be saved to the folder named Synthetic_Checkpoints_rank_k 
in JSON format.

# bcsstk13 - All Settings
Before running these experiments, bcsstk13.mtx should be downloaded
from https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc1/bcsstk13.html
and placed in the same directory as the script.

To run the bcsstk13 experiments included in the main paper and in
section G of the supplementary materials, run

> python3 bcsstk13_experiments.py --test_rank k --num_trials t

in order to run them for a target rank k and for t trials. The results
will be saved to the folder Checkpoints_bcsstk13s_rank_k in .pickle format.
These can be quickly converted to .json format as specified below.

# isolet transpose - All Settings
To run the isolet_transpose experiments included in the main paper and in
section G of the supplementary materials, run

> python3 isolet_experiments_col_dim_large.py --test_rank k --num_trials t

in order to run them for a target rank k and for t trials. The results
will be saved to the folder Checkpoints_bcsstk13s_rank_k in .pickle format.
These can be quickly converted to .json format as specified below. Before
running this, make sure isolet1+2+3+4.csv is in the same directory. This
can be downloaded from https://archive.ics.uci.edu/ml/datasets/ISOLET.

# caltech101 Images - All Settings
To run the caltech101 experiments included in the main paper and in
section G of the supplementary materials, run

> python3 caltech101_experiments.py

The results will be saved to the folder Caltech101_Checkpoints. 
Similarly, these can be quickly converted to .json format as specified below.
Before running this, make sure to have the Caltech 101 images in the
same directory. These can be downloaded from http://www.vision.caltech.edu/Image_Datasets/Caltech101/.

# Section F of Appendix
Before running the script, download the SECOM and gastro_lesions datasets
using the links given in Section F. Create a folder titled "Additional Datasets"
Within that folder, place secom.data in the secom folder, and for gastro_lesions,
place data.txt in a new folder titled "gastroenterology_dataset".

To obtain the results for the comparison between our protocol and the
distributed protocol of https://arxiv.org/abs/1605.08795, run the
following command:

> python3 dist_greedy_baseline.py --main_paper 0 --dataset_name ds --test_rank k --num_trials t

where test_rank and num_trials are the same as before, and ds can be
one of gastro_lesions and secom. The results will be saved in JSON
format in the checkpoints_ds_distributed_greedy_l1_protocol_comparison
folder.

# Converting pickled files to JSON format.

Many of our results will be saved as pickled files, within directories
whose names are given above. To convert them to JSON format, run

> python3 convert_pickle_to_json.py --directory_name folder

which converts all the files in the directory whose name is "folder"
from .pickle files to .json files.

## Plotting the Results

# Synthetic Data Experiments
To plot the synthetic data experiments included in the main paper, run

> python3 dist_greedy_baseline.py --main_paper 2

This assumes that the synthetic data experiments have already been run
for ranks 10, 20, and 30.

# Section F of Appendix
To plot the results for the comparison between our protocol and the
distributed protocol of https://arxiv.org/abs/1605.08795, run the
following command:

> python3 dist_greedy_baseline.py --main_paper 3

The plots will be saved to the current working directory.

# Section G of Appendix
To make the plots in our main paper and appendix. Simply run 

> python3 experiment_stats.py 

> python3 plot.py 

in this order. The plots will be saved to the folder Plot. 