# Multi-objective Bayesian optimization with heuristic objectives

This repository contains the code implementing MANATEE used for the analyses in the associated paper. 

MANATEE (Multi-objective bAyesiaN optimizAtion wiTh hEuristic objEctives) is a multi-objective Bayesian optimization method, which automatically up- or downweights heuristic objectives based on the properties of their posterior functional form. These properties are specified as desirable or non-desirable *behaviours*, which reflect user's expectations of what a useful heuristic objective should look like. We propose for MANATEE to be used for parameter optimization in biomedical and molecular data analysis pipelines, where objectives being optimized correspond to heuristic measures of a pipeline's success. We used MANATEE to optimize the cofactor normalization parameter for the analysis of imaging mass cytometry (IMC) data and the proportion of highly variable genes for the analysis of single-cell RNA-sequencing (scRNA-seq) data.

# Installation

## Part 1

To set up a virtual environment, we profide a file `Pipfile` in the root directory which specifies all required dependencies.
`Pipfile` is a file that tracks the dependencies of a project used by a Python package manager `pipenv`. 
To install the dependencies using a `Pipfile`, a user will need Python 3.8, `pip`, and `pipenv` installed in their system.

In our paper, we perform comparisons with an existing method USeMO (Belakaria et al., 2020). USeMO is not availabe as a package and
its Python implementation can be accessed on Github at https://github.com/belakaria/USeMO. In order to execute their code, we forked 
their repository and added files `setup.py` and `init.py` in order to be able to install their code as a package with `pip` in our virtual environment.
You can find our public fork of USeMO here: https://github.com/alinaselega/USeMO
To install USeMO in a virtual environment, a user will need access to a Github account.

Follow these steps:

1. Provided Python 3.8 and `pip` are available, install `pipenv` with: `pip install pipenv`.

2. Navigate to the root directory where `Pipfile` is located and run `pipenv install`. This will create a virtual environment and install all dependencies.
If you want to create a new directory with your preferred name, move `Pipfile` there, navigate there, and run `pipenv install`.

(Please note that there should be no `Pipfile.lock` in this directory. We have removed `Pipfile.lock` from the root directory in the updated supplementary materials zip file
as part of the rebuttal. `Pipfile.lock` contains information about the specific build of each library, which is platform-dependent and specific to the authors'
environment. We originally included it for information purposes but removed it in the rebuttal to avoid confusion. A new `Pipfile.lock` will be created in the reviewer's
directory upon creating the virtual environment.)

3. To install USeMO, first clone our fork of USeMO by running `git clone git@github.com:alinaselega/USeMO.git` or using your preferred method. 
Make sure you navigated outside of the directory with `Pipfile` before cloning. 

4. Activate your virtual environment by navigating to the directory containing `Pipfile` and running `pipenv shell`.

5. Navigate to the cloned repository of USeMO and run `pip install -e .` , this will install the code implementing USeMO
under the name `usemo` in the virtual environment.

This concludes the installation process for the dependencies required to execute our code. 

## Part 2

Two more packages are required to run the notebooks that we provided to reproduce the figures 
and tables in the paper:

- `jupyter notebook` is required to run the notebooks,
- `OApackage` is required to compute Pareto optimal points displayed in Figures 3, 5, 8.

To install these packages:

1. Install `jupyter notebook` by activating your virtual environment (run `pipenv shell` while in the directory) 
and running `pipenv install notebook`

2. `OApackage` requires the library `SWIG`. `SWIG` can be installed in your sysem by running `apt-get install swig3.0`. 
See this page for further information: https://open-box.readthedocs.io/en/latest/installation/install_swig.html

Once `SWIG` is available, activate your virtual environment (run `pipenv shell` while in the directory) and install 
`OApackage` with `pip install oapackage==2.3.8`.

# Quickstart 

MANATEE can be run on the command line by executing the script `mobo_experiment.py` with corresponding arguments. 
The code supports three experiments discussed in the paper (toy, IMC, scRNA-seq) by implementing the corresponding pipelines required for new acquisitions. 
MANATEE can be executed by specifying the experiment and the parameter optimization bounds, specifically as follows:

- activate your virtual environment (navigate to the directory and run `pipenv shell`)
- navigate to `automl2023-code` (or copy the directory `automl2023-code` to the directory of your virtual environment if you made a new directory)
- run `python mobo_experiment.py --experiment toy --x_min 0 --x_max 1 --logging nolog`

This will run MANATEE-SA on toy data for the default number of steps (10). The code will print the output of the method at each iteration,
including the loss of a Gaussian process fit, the values of the behaviours (explainability, inter-objective agreement, max not at boundary),
the inferred inclusion probabilities of the objectives, and the next acquisition.

To run the scRNA-seq highly variable gene selection experiment, run: 

`python mobo_experiment.py --experiment citeseq --x_min 0.01 --x_max 0.5 --logging nolog --optimise_iter 2`

Note that this experiment uses real data (the files are provided in `automl2023-code`) so it will take longer, which is why 
specifying a smaller number of optimization steps is advised for testing purposes. 
Please note that we were not able to provide the data files for the cofactor selection experiment (`--experiment imc`) due 
to the size limit on the supplementary materials zip file (50 MB). 

Also please note that executing the code will create directories `cache` and `cache_EXPERIMENT` in your home directory, which are used for 
faster data loading between runs. These directories can be safely removed.

## Logging and tracking with Weights & Biases

The code supports [Weights and Biases](https://wandb.ai) integration (with `--logging` set to `wandb` by default) which tracks acquisitions, 
meta-objectives (ARI, NMI), and objective inclusion probabilities and behaviours. These can be also accessed from the dictionary returned by 
the function `main` in `mobo_experiment.py`, along with the acquisition function values at each step and the final acquired datasets. 

When `--logging wandb` is set (which it is by default), the code will write the dictionary as `run_name + run_id.pt` file in a new directory 
called `project + sweep_id`, where `project` denotes the active Weights & Biases project, `sweep_id` denotes the ID of the sweep 
(see README in `automl2023-materials/yaml files/`), and `run_name` and `run_id` denote the name and ID of the current run tracked 
with Weights & Biases. To limit the outputs of the code, no outputs except the print statements are generated when executing the command with `--logging nolog`.

Thus, when running an example run locally without tracking, it is important to specify `--logging nolog`
because otherwise, the code will try to set up tracking with Weights & Biases. Thus, if a user does not want to use Weights & Biases, 
always include `--logging nolog` flag. Further, you can adjust the number of iterations by specifying e.g. `--optimise_iter 2` to limit the printed output
to only two iterations.

## Additional arguments

There are additional optional arguments described in the help message which can be accessed with `python mobo_experiment.py -h`. 

By default, `mobo_experiment.py` will execute only MANATEE (specifically, the MANATEE-SA version), but the code also supports other methods considered in the paper 
(MANATEE-AS, RS and RA baselines, qNEHVI with approximate hypervolume computation, qNParEGO, USeMO). 

These methods can be executed by adding the `--strategy` argument to the command above, specifically:
- RS: `--strategy "random prob"`
- RA: `--strategy "random loc"`
- qNEHVI: `--strategy botorch`
- qNParEGO: `--strategy qparego`
- USeMO: `--strategy usemo`

MANATEE-AS uses default `strategy` (so does not need `--strategy` specified) and can be executed by specifying `--ucb_scal exhaustive`, 
which denotes exhaustive computation of the acquisition function (recommended) or `--ucb_scal mc` for sampling-based computation (which will be slower).

Please note that qNEHVI and qNParEGO sometimes throw an Runtime error if they are unable to perform optimization on the given training set.
One could try running the command again with a different random seed (add `--seed X` to the command) or increasing the training set size (add `--num_train_pts 10`).

Also please note that the warnings printed when executing qNEHVI, qNParEGO, or USeMO (existing methods we compare to) are generated by the implementations of
those methods and are not a part of our code.

## Seed reproducibility

To verify that the code generates the same output for the same seed, run 
`python mobo_experiment.py --experiment toy --x_min 0 --x_max 1 --logging nolog --optimise_iter 1 --seed 5` (or any other seed value) twice. 
All printed outputs will be the same.
