# Optimal tree ensembles with column generation

This project contains code for learning optimal tree ensembles using column generation.

This codebase is purely intended to reproduce our results. For a user-friendly Python library, we advise to install the Python library, which can be found in the other folder supplied with this anonymized code. 

## Environment

The code is written in Python 3.12. We use scikit-learn to model ML architectures, and Gurobi for the LP solving. A requirements.txt details further requirements of our project. Note that you need a Gurobi license and need to download the Gurobi mathematical solver. We tested our project on a Windows 11 environment, and a high-performance Linux cluster.


## Folder Structure
The repository contains the following folders:

- **`data/`**: Contains the data sets.
- **`src/`**: Contains the main code.
  - **`solvers/`**: Contains different solvers for the master problem
  - **`utils/`**: Contains data handling and parsers
- **`hpc/`**: Contains files used to launch multiple parallel experiments on a high-performance cluster using HyperQueue (HPC)

On the first level you can see main.py which implements the overall policy training and evaluation loop. We have 3 possible scripts that can be run:
1) `column_generation.py`: the main column generation procedure
2) `benchmarks.py`: the script running Adaboost, XGBoost, and lightGBM
3) `single_shot.py`: the script were we fit an Adaboost ensemble and next reweight it using a column gen method 

All controls of the scripts are in the various parsers, see the `src/utils/parsers` folder. 

## To the make the code work

 * Create a local python environment by subsequently executing the following commands in the root folder
	* `python3 -m venv venv`
	* `source venv/bin/activate`
	* `python -m pip install -r requirements.txt`
	* `deactivate`

Next, run `main.py` or, to reproduce our results, see the files in the `hpc` folder to conduct parallel experiments.
Results are stored in a `json` file in a `Results` folder (will be automatically created).

## Note on Blossom

For the experiments using optimal trees, we used the Blossom library. This library can be easily installed for Linux by cloning the repo and following the manual at this url: https://gitlab.laas.fr/ehebrard/blossom/-/tree/gcforest?ref_type=heads

For adapting Blossom and rebuilding on Windows, we built the library MSYS2 with MinGW to build swig.