# Prepare Dataset

## Download

SiN and HfO datasets can be downloaded at [this link](https://drive.google.com/drive/folders/1NtORYgIn7WGJ7XmsRR1YeWaw4qSWS9xx?usp=sharing), which includes four tarballs described as follows.

### SiN.tar
* Contains the train, test, and validation splits utilized in this study.
* Also includes the out-of-distribution (OOD) set.
* All datasets are formatted in extended XYZ file format (.xyz).

### SiN_raw.tar
* Incorporates the raw dataset prior to sampling.
* DFT calculations for various scenarios are saved in extended XYZ format (.xyz).
* Additionally, the sampling script employed in this study is included.

### HfO.tar
* Contains the train, test, and validation splits utilized in this study.
* Also includes the out-of-distribution (OOD) set.
* All datasets are formatted in extended XYZ file format (.xyz).

### HfO_raw.tar
* Incorporates the raw dataset prior to sampling.
* DFT calculations for various scenarios are saved in extended XYZ format (.xyz).
* Additionally, the sampling script employed in this study is included.

Users can extract the tarballs that have been used for the benchmark.

```
# extract tar files at the datasets directory
cd ../../datasets
tar xf SiN.tar
tar xf HfO.tar

# optional
rm SiN.tar
rm HfO.tar
```

## Sample and Split (Optional)

If users use the prepared datasets (SiN.tar and HfO.tar), please omit this step.  

We archive the preparation code that samples data from the raw datasets and split them in train/validation/test sets.

The code can be used as following.

```
# the split code should be located at raw dataset directory.

# SiN
cp SiN_Split.py ../../datasets/SiN_raw
cd ../../datasets/SiN_raw
python SiN_Split.py

# HfO
cp HfO_Split.py ../../datasets/HfO_raw
cd ../../datasets/HfO_raw
python HfO_Split.py

```

### Split Information (train/valid/test)

A dataset is divided into train, validation, and test sets with a ratio of 8:1:1.

The details are described in our benchmark paper.


## Convenient Preprocessing Script

To train models using this framework, users should preprare a database (with .lmdb format).

Using [this script](./run_preprocess_data.sh) as following, users can easily do.

```
./run_preprocess_data.py $DATA $OUTDATA_TYPE
```
You should specify `DATA` and `OUTDATA_TYPE`.


### ■ Option 1 (`cloud`)
: save coordinates
```
./run_preprocess_data.py SiN cloud
./run_preprocess_data.py HfO cloud
```

### ■ Option 2 (`graph`)
: save coordinates and edges (which are generated with a cutoff radius of 6.0 and a max number of neighborhood atoms of 50)
```
./run_preprocess_data.py SiN graph 6.0 50
./run_preprocess_data.py HfO graph 6.0 50
```

Some models can generate graphs from coordinates of atoms on-the-fly, but some cannot (such as NequIP, Allegro, and MACE).

For latter models, users should generate graphs in advance and save the graphs into .lmdb file.

If a cutoff radius is given, atom cloud data can be converted into graph data and saved with edge indices.  
