# Usage

## Docker build

```bash
docker build -t clasp:v0.1 docker/
```

Note: There might be issues caused by SciPy installation or version. Reinstalling SciPy once solved the problem (2024/04/01).

## CLaSP Training Data Preparation

### Download data from COD

#### Download metadata

```bash
docker run --rm -it -v $PWD:/workspace clasp:v0.1 bash
cd preprocess
# example
python download_cod_metadata.py cod_metadata_20240523.csv
```

#### Download crystal structures (CIF)

Reference: https://wiki.crystallography.net/howtoobtaincod/

```bash
mkdir -p COD; rsync -av --delete rsync://www.crystallography.net/cif/ COD/
```

## Run Pre-training

Edit `configs/pretraining.yaml` to specify the path of the dataset.

```bash
docker run --rm --gpus 8 -it --shm-size=500g -v $PWD:/workspace -v /mnt/data/cod:/cod:ro -v /mnt/data/hf_cache:/root/.cache/huggingface/ clasp:v0.1 python train_OnMemory.py
 
# specify the value to override parameters
python train_OnMemory.py batch_size=256 n_epochs=100

```

## Fine-tuning with keyword caption

### Generate keyword captions

Edit `configs/finetuning.yaml` to specify the path of the dataset.

```bash
docker run --rm --gpus 8 -it --shm-size=500g -v $PWD:/workspace -v /mnt/data/cod:/cod:ro -v /mnt/data/hf_cache:/root/.cache/huggingface/ --entrypoint '' vllm/vllm-openai:v0.5.4 bash
export HF_HOME=/root/.cache/huggingface

cd preprocess
python retrieve_abstract.py
python generate_keywords_from_abst_vllm.py
```

### Run Fine-tuning

```bash
docker run --rm --gpus 8 -it --shm-size=500g -v $PWD:/workspace -v /mnt/data/cod:/cod:ro -v /mnt/data/hf_cache:/root/.cache/huggingface/ clasp:v0.1 python train_finetuning.py
```

## Model evaluation

### ROC evaluation

```bash
docker run --rm --gpus 8 -it --shm-size=500g -v $PWD:/workspace -v /mnt/data/cod:/cod:ro -v /mnt/data/hf_cache:/root/.cache/huggingface/ clasp:v0.1 bash
cd eval_scripts
python eval_zero_shot_roc.py
```

### Visualizing embeddings

Refer to `notebooks\visualize_embeddings.ipynb`.
