# A Framework of SO(3)-equivariant Non-linear Representation Learning and its Application to Electronic-Structure Hamiltonian Prediction

## Overview
This is the README file for training and testing of our work entitled **'A Framework of SO(3)-equivariant Non-linear Representation Learning and its Application to Electronic-Structure Hamiltonian Prediction'**. The software package for this work is called as **'TraceGrad'**, which is an abbreviation of the core of our method. PLEASE read this file with a reader or editor with a Markdown render.
In this document, we will explain how the TraceGrad module is used in two different baseline schemes with two representative datasets for validation. Below, we will detail the configurations and steps for running both schemes.

## !!NOTE
- Due to limitations on data upload volume on the Code Ocean platform, we are only able to upload a limited set of databases, such as Bilayer Graphene, directly on the platform. Additionally, Code Ocean imposes restrictions on compute time and GPU memory. Therefore, **we strongly recommend that reviewers download the relevant datasets from the links provided below to their own servers for replication experiments, if they are interested.** **Links to download the complete datasets are provided below.**
- Due to the specific requirements of experimental environments for databases across different benchmark series, we have configured the experimental setup on the Code Ocean platform exclusively for databases associated with the DeepH benchmark series. Reviewers interested in reproducing experiments from the QH9 benchmark series are encouraged to refer to the environment setup guide provided to set up a dedicated environment on their own servers.

## Experiments on the DeepH Benchmark Series

### Introduction
- **Framework**: DeepH-E3+TraceGrad. The DeepH-E3 method [2] serves as the baseline model here.

- **Code Path**: your_path/to/TraceGrad_DeepH_Benchmark_Series

### System Requirements
The experiments for 'DeepH-E3+TraceGrad' are conducted on a Tesla A6000 GPU cluster. Each GPU card in the cluster has 48 GiB of memory. The software environment required is outlined below:

- Python Version: 3.11.8
- torch Version: 2.0.1+cu117
- tqdm Version: 4.66.1
- pymatgen Version: 2023.8.10
- torch_geometric Version: 2.3.1
- pathos Version: 0.3.1
- h5py Version: 3.9.0
- e3nn Version: 0.4.4
- tensorboard Version: 2.15.1

### Installation
- conda create -n DeepHE3_tracegrad python=3.11.8
- conda activate DeepHE3_tracegrad
- pip install torch==2.0.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
- pip install -r requirements.txt

### Dataset Preparation

Experiments are conducted on six crystalline material databases released by the DeepH series [1,2]. These databases include:

    1. Monolayer Graphene (MG)
    2. Monolayer MoS2 (MM)
    3. Bilayer Graphene (BG)
    4. Bilayer Bismuthene (BB)
    5. Bilayer Bi2Te3 (BT)
    6. Bilayer Bi2Se3 (BS)

Create a `data` folder in your working directory to store these datasets. The download links and instructions for each dataset are as follows:

- MG Dataset: [Download here](https://zenodo.org/records/7553640/files/Monolayer_graphene_dataset.zip?download=1). Unzip as `/your_path/data/Monolayer_graphene_dataset`.
- MM Dataset: [Download here](https://zenodo.org/records/7553640/files/Monolayer_MoS2_dataset.zip?download=1). Unzip as `/your_path/data/Monolayer_MoS2_dataset`.
- BG Dataset: 
  - Non-twisted samples: [Download here](https://zenodo.org/records/7553640/files/Bilayer_graphene_dataset.zip?download=1). Unzip as `/your_path/data/Bilayer_graphene_dataset`.
  - Twisted subset: [Download here](https://zenodo.org/records/7553640/files/Bilayer_graphene_twisted.zip?download=1). Unzip as `/your_path/data/Bilayer_graphene_twisted`.
- BB Dataset: 
  - Non-twisted samples: [Download here](https://zenodo.org/records/7553640/files/Bismuth_dataset.zip?download=1). Unzip as `/your_path/data/Bilayer_Bismuth_dataset`.
  - Twisted subset: [Download here](https://zenodo.org/records/7553640/files/Bismuth_twisted.zip?download=1). Unzip as `/your_path/data/Bilayer_Bismuth_twisted`.
- BT Dataset: 
  - Non-twisted samples: [Download here](https://zenodo.org/records/7553843/files/Bi2Te3_dataset_soc.zip?download=1). Unzip as `/your_path/data/Bilayer_Bi2Te3_dataset`.
  - Twisted subset: [Download here](https://zenodo.org/records/7553843/files/Bi2Te3_twisted_soc.zip?download=1). Unzip as `/your_path/data/Bilayer_Bi2Te3_twisted`.
- BS Dataset: 
  - Non-twisted samples: Download from four links and unzip all to `/your_path/data/Bilayer_Bi2Se3_dataset`.
    - [Dataset 1](https://zenodo.org/records/7553827/files/Bi2Se3_dataset1.zip?download=1)
    - [Dataset 2](https://zenodo.org/records/7553827/files/Bi2Se3_dataset2.zip?download=1)
    - [Dataset 3](https://zenodo.org/records/7553827/files/Bi2Se3_dataset3.zip?download=1)
    - [Dataset 4](https://zenodo.org/records/7553827/files/Bi2Se3_dataset4.zip?download=1)
  - Twisted subset: [Download here](https://zenodo.org/records/7553827/files/Bi2Se3_twisted.zip?download=1). Unzip as `/your_path/data/Bilayer_Bi2Se3_twisted`.


### Training and Testing Instructions:
These instructions provide a step-by-step guide for training and testing the 'DeepH-E3+TraceGrad' framework on each of the six datasets. Follow these steps for each dataset:

#### MG Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Monolayer_graphene_train_test.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Monolayer_graphene_train_test.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Monolayer_graphene_train_test.ini`, set `is_training` as False, and run the command.

#### MM Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Monolayer_MoS2_train_test.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Monolayer_MoS2_train_test.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Monolayer_MoS2_train_test.ini`, set `is_training` as False, and run the command.

#### BG Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Bilayer_graphene_train_test.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Bilayer_graphene_train_test.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Bilayer_graphene_train_test.ini`, set `is_training` as False, and run the command.

  ii. **Testing on Twisted Samples**:

      - For testing the trained network on twisted samples, update the `trained_model_dir` in `/your_path/inis/Bilayer_graphene_eval_twist.ini` with the directory of the trained model (found in a subfolder of the `save_dir` from the training phase, containing `.pkl` files and automatically saved src files). Also, update `processed_data_dir`, `save_graph_dir`, and `save_dir`, replacing "/your_path/" with your current directory. Then, run `python3 TraceGradH-evaltwisted.py /your_path/inis/Bilayer_graphene_eval_twist.ini` to get the test results.

#### BB Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Bilayer_Bismuth_train_test_soc.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Bilayer_Bismuth_train_test_soc.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Bilayer_Bismuth_train_test_soc.ini`, set `is_training` as False, and run the command.

  ii. **Testing on Twisted Samples**:

      - For testing the trained network on twisted samples, update the `trained_model_dir` in `/your_path/inis/Bilayer_Bismuth_eval_soc_twist.ini` with the directory of the trained model (found in a subfolder of the `save_dir` from the training phase, containing `.pkl` files and automatically saved src files). Also, update `processed_data_dir`, `save_graph_dir`, and `save_dir`, replacing "/your_path/" with your current directory. Then, run `python3 TraceGradH-evaltwisted.py /your_path/inis/Bilayer_Bismuth_eval_soc_twist.ini` to get the test results.

#### BT Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Bilayer_Bi2Te3_train_test_soc.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Bilayer_Bi2Te3_train_test_soc.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Bilayer_Bi2Te3_train_test_soc.ini`, set `is_training` as False, and run the command.

  ii. **Testing on Twisted Samples**:

      - For testing the trained network on twisted samples, update the `trained_model_dir` in `/your_path/inis/Bilayer_Bi2Te3_eval_soc_twist.ini` with the directory of the trained model (found in a subfolder of the `save_dir` from the training phase, containing `.pkl` files and automatically saved src files). Also, update `processed_data_dir`, `save_graph_dir`, and `save_dir`, replacing "/your_path/" with your current directory. Then, run `python3 TraceGradH-evaltwisted.py /your_path/inis/Bilayer_Bi2Te3_eval_soc_twist.ini` to get the test results.

#### BS Dataset:

  i. **Train and Test**:

     - Update the paths in `/your_path/inis/Bilayer_Bi2Se3_train_test_soc.ini` for `processed_data_dir`, `save_graph_dir`, and `save_dir`. Typically, replace "/your_path/" with your current directory, while keeping the subdirectories as default.

     - Execute the following command: `python3 TraceGradH-train-test.py /your_path/inis/Bilayer_Bi2Se3_train_test_soc.ini`. This trains the network and tests its performance.

     - Note: To test an already trained network, place the path of the trained `.pkl` file in `checkpoint` of `Bilayer_Bi2Se3_train_test_soc.ini`, set `is_training` as False, and run the command.

  ii. **Testing on Twisted Samples**:

      - For testing the trained network on twisted samples, update the `trained_model_dir` in `/your_path/inis/Bilayer_Bi2Se3_eval_soc_twsit.ini` with the directory of the trained model (found in a subfolder of the `save_dir` from the training phase, containing `.pkl` files and automatically saved src files). Also, update `processed_data_dir`, `save_graph_dir`, and `save_dir`, replacing "/your_path/" with your current directory. Then, run `python3 TraceGradH-evaltwisted.py /your_path/inis/Bilayer_Bi2Se3_eval_soc_twsit.ini` to get the test results.

## Experiments on the QH9 Benchmark Series

### Introduction
- **Framework**: QHNet+TraceGrad. The QHNet method[4] serves as the baseline model here.

- **Code Path**: your_path/to/TraceGrad_QH9_Benchmark_Series

### System Requirements
- Python Version: 3.8.19
- PyTorch: 1.11.0
- PyG: 2.1.0
- e3nn: 0.5.1
- pyscf: 2.2.1 (QH9-Stable)
- pyscf: 2.3.0 (QH9-Dynamic-100k)
- hydra-core: 1.3.2

### Installation
- conda create -y -n QHNet_tracegrad python=3.8.19
- conda activate QHNet_tracegrad
- conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
- pip install scipy
- conda install pyg==2.1.0 -c pyg
- pip install tqdm hydra-core>=1.2.0 pyscf rdkit transformers torch_ema e3nn lmdb apsw gdown

### Dataset Preparation

Experiments are conducted on the **QH9-stable** and **QH9-dynamic** databases released by [3], the split strategies selected for them are **ood** and **mol**, respectively, please refer to the original paper for details.

### Dataset Download
Create a `datasets` folder in your working directory to store these datasets. You can download the `datasets` folder, which includes the raw data files `QH9Stable.db` and `QH9Dynamic.db`, via [this Google Drive link](https://drive.google.com/drive/folders/13pPgBh3XvN2FCpowfnA8TT4VJ0OTceNM?usp=sharing) or [OneDrive Link](https://tamucs-my.sharepoint.com/:f:/g/personal/haiyang_tamu_edu/Ev4XIVcumhVFtaI8lUkIHXABHkKnKgWSJ5LYZOo67UKO0g?e=tsXkT1). Meanwhile, we provide the zip files of the datasets in this [google drive link](https://drive.google.com/drive/u/0/folders/1LXTC8uaOQzmb76FsuGfwSocAbK5Hshfj).

### Dataset Usage
We provide the datasets as commonly used PyG datasets. Here are simple examples to load our datasets with a few lines of code. 
```python
from torch_geometric.loader import DataLoader
from datasets import QH9Stable, QH9Dynamic

### Use one of the following lines to Load the specific dataset
dataset = QH9Stable(split='size_ood')  # QH9-stable-ood
dataset = QH9Dynamic(split='mol', version='100k')  # QH9-dynamic-mol

### Get the training/validation/testing subsets
train_dataset = dataset[dataset.train_mask]
valid_dataset = dataset[dataset.val_mask]
test_dataset = dataset[dataset.test_mask]

### Get the dataloders
train_data_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_data_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False)
test_data_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
```

### Training and Testing Instructions:
These instructions provide a step-by-step guide for training and testing the 'QHNet+TraceGrad' framework on each of the two datasets. Follow these steps for each dataset:

#### Training
```shell script
### Modify the configurations in config/config.yaml (or pass the configurations as args) as needed, and then run
python main.py datasets=QH9-stable datasets.split=size_ood # QH9-stable-ood
python main.py datasets=QH9-dynamic datasets.split=mol datasets.version=100k  # QH9-dynamic-100k-mol
```

#### Evaluating the trained model
  ```shell script
### Modify the configurations in config/config.yaml (or pass the configurations as args) as needed (including the trained_model arg), and then run
python test.py datasets=QH9-stable datasets.split=size_ood  trained_model='your_path/to/TraceGrad_QH9_Benchmark_Series/output/xxx/results_best.pt' # QH9-stable-ood 
python test.py datasets=QH9-dynamic datasets.split=mol datasets.version=100k trained_model='your_path/to/TraceGrad_QH9_Benchmark_Series/output/xxx/results_best.pt' # QH9-dynamic-100k-mol
```


## References
1. Li, H., Wang, Z., Zou, N., Ye, M., Xu, R., Gong, X., Duan, W., and Xu, Y. (2022). Deep-learning density functional theory hamiltonian for efficient ab initio electronic-structure calculation. *Nature Computational Science*, 2(6), 367–377.
2. Gong, X., Li, H., Zou, N., Xu, R., Duan, W., and Xu, Y. (2023). General framework for e(3)-equivariant neural network representation of density functional theory hamiltonian. *Nature Communications*, 14(1), 2848.
3. Yu, H., Liu, M., Luo, Y., Strasser, A., Qian, X., Qian, X., & Ji, S. (2024). Qh9: A quantum hamiltonian prediction benchmark for qm9 molecules. *Advances in Neural Information Processing Systems*, 36.
4. Yu, H., Xu, Z., Qian, X., Qian, X., & Ji, S. (2023, July). Efficient and equivariant graph networks for predicting quantum Hamiltonian. In International Conference on Machine Learning (pp. 40412-40424). PMLR.
