# Usage
This repository is built upon the ClaSP framework.
## Docker build

```bash
docker build -t clasp:v0.1 docker/
```

Note: There might be issues caused by SciPy installation or version. Reinstalling SciPy once solved the problem (2024/04/01).

## Training Data Preparation

### Download data from COD

#### Download metadata

```bash
docker run --rm -it -v $PWD:/workspace clasp:v0.1 bash
cd preprocess
# example
python download_cod_metadata.py cod_metadata_20240909.csv
```

#### Download crystal structures (CIF)

Reference: https://wiki.crystallography.net/howtoobtaincod/

```bash
mkdir -p COD; rsync -av --delete rsync://www.crystallography.net/cif/ COD/
```

#### Generate keyword captions

Edit `configs/finetuning.yaml` to specify the path of the dataset.

```bash
docker run --rm --gpus 4 -it --shm-size=500g -v $PWD:/workspace -v /raid/COD:/cod:ro -v /raid/clasp_data:/data -v /raid/hf_cache:/root/.cache/huggingface/ --entrypoint '' vllm/vllm-openai:v0.5.4 bash
export HF_HOME=/root/.cache/huggingface

cd preprocess
python retrieve_abstract.py
python generate_keywords_from_abst_vllm.py

```

## Run traininig

Edit `configs/pretraining.yaml` to specify the path of the dataset.

```bash
docker run --rm --gpus 4 -it  --shm-size=200g -v $PWD:/workspace -v /raid/COD:/cod:ro -v /raid/clasp_data:/data -v /raid/hf_cache:/root/.cache/huggingface/ clasp:v0.16 python train_OnMemory.py

# specify the value to override parameters
python train.py batch_size=1024 n_epochs=2000 attention=cross
```

## ROC evaluation

```bash
docker run --rm --gpus 4 -it  --shm-size=200g -v $PWD:/workspace -v /raid/COD:/cod:ro -v /raid/clasp_data:/data -v /raid/hf_cache:/root/.cache/huggingface/ clasp:v0.16 bash
cd eval_scripts
python eval_zero_shot_roc.py
```