# What Actually Matters for Materials Discovery: Pitfalls and Recommendations in Bayesian Optimization

This repository contains the code accompanying the paper "What Actually Matters for Materials Discovery: Pitfalls and Recommendations in Bayesian Optimization."

## Repository Structure

- `cache_features.py` – Script for caching fixed features.
- `compute_rogi.py` – Computes the ROGI (Roughness Index) for each dataset and feature type.
- `rogi_analysis.ipynb` – Jupyter notebook for ROGI analysis and visualization.
- `plot_param_sensitivity.ipynb` – Generates plots for parameter sensitivity experiments.
- `plot_results_feature_finetuning.ipynb` – Generates plots for feature fine-tuning experiments.
- `plot_visual_abstract.ipynb` – Generates visual abstract figures.
- `plot_results_features.ipynb` – Generates plots comparing feature types.
- `run_fixed_features.py` – Runs BO experiments with fixed features.
- `run_feature_finetuning.py` – Runs BO experiments with feature fine-tuning.
- `run_bayesian_finetuning.py` – Runs BO experiments with Bayesian fine-tuning.

## Installation

Install the required dependencies:

```sh
pip install -r requirements.txt
```

## Datasets

The datasets can be downloaded into `data/` from the following repositories:
- [`redox-mer`, `solvation`, `kinase`, `laser`, `pce`, `photoswitch`](https://github.com/wiseodd/llm-bayesopt-exps/tree/main)
- [`ampc`, `d4`](https://github.com/wiseodd/bo-async-feedback/tree/master/data)

## Caching Features

To cache features, run:

```sh
python cache_features.py --feature_type fingerprints --problem solvation
```

Available options:

- **Feature Types:** `fingerprints`, `molformer`, `t5-base-chem`, `mordred`, `degree_of_conjugation`, `force_field`, `dft`
- **Feature Reduction:** `default`, `average`
- **Prompt Type:** `single-number`, `just-smiles`, `naive`, `completion`
- **Problems:** `redox-mer`, `solvation`, `kinase`, `laser`, `pce`, `photoswitch`, `ampc`, `d4`

For a full list of options, run:

```sh
python cache_features.py --help
```

## Running Bayesian Optimization Experiments

### Fixed Features

Run BO experiments using predefined feature sets:

```sh
python run_fixed_features.py --problem solvation --method gp --feature_type fingerprints --kernel tanimoto --acqf ts --n_init_data 10 --exp_len 200 --randseed 0
```

### Feature Fine-Tuning

Run BO with fine-tuned molecular features:

```sh
python run_feature_finetuning.py --problem redox-mer --method laplace --foundation_model molformer --kernel matern_2.5 --acqf ts --n_init_data 10 --exp_len 200 --randseed 0
```

### Bayesian Fine-Tuning

Run BO with Bayesian fine-tuning of features:

```sh
python run_bayesian_finetuning.py --problem kinase --method fsplaplace --foundation_model t5-base-chem --kernel linear --acqf ts --n_init_data 10 --exp_len 200 --randseed 0
```

## Model and Experiment Configuration

Each script supports multiple command-line arguments for configuring the BO pipeline, including:

- **Problems:** `redox-mer`, `solvation`, `kinase`, `laser`, `pce`, `photoswitch`, `ampc`, `d4`
- **Methods:** `random`, `gp`, `laplace`, `fsplaplace`, `gp_default` (GP used in Section 4.1, blue line in Fig 2), `gp` (model used in Sections 4.2 to 4.4)
- **Feature Types:** `fingerprints`, `molformer`, `t5-base-chem`, `mordred`, `degree_of_conjugation`, `force_field`, `dft`, `all_features`, `hand_crafted_expert`, `hand_crafted_general`, `data_driven`
- **Kernels:** `tanimoto`, `matern_0.5`, `matern_1.5`, `matern_2.5`, `rbf`, `linear`
- **Acquisition Functions:** `ei`, `ucb`, `ts`
- **Foundation Models (for fine-tuning):** `molformer`, `roberta-large`, `t5-base`, `t5-base-chem`, `gpt2-medium`, `gpt2-large`, `llama-2-7b`
- **Optimization Parameters:** `lr`, `wd`, `noise_var`, `randseed`, `exp_len`

For a full list of options, run:

```sh
python run_fixed_features.py --help
```

