
# RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design


This repository is the official source code of our paper:

**Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design**


In this repository, we provided a novel approach to designing RNA sequences from their tertiary structures. Firstly, we provided a well-collated RNA tertiary structure dataset that collects and cleaned from RNASolo dataset. Secondly, we proposed a novel RNA sequence design approach with hierarchical representation learning. Thirdly, we conducted extensive experiments on various benchmarks and external datasets. 

The completed datasets would be released after the review period.

## Overview


<details open>
<summary>Code Structures</summary>

- `API/` contains the featurizer, recorder and dataset configurations.
- `methods/` contains our proposed method.
- `model/` contains the detailed network architectures of our proposed model.
- `data/` contains completed dataset (separated into train_data.pt, valid_data.pt, test_data.pt, Rfam.pt, and RNA_PUZZLE.pt)
- `utils.py` contains some details in the project, including checkpoint saving, log recording, etc. It also contains the TMScore and RMSD calculator utilized in our project.
-  `main.py` contains the main function of running the project
-  `parser.py` contains the global parameters of the overall experiment
-  `requirements.txt` runnable environment settings

## Installation

We have provided an environment configuration file. Users can easily replicate the environment using the following commands:

```shell
cd rdesign
conda env create -n Rdesign
conda activate Rdesign
pip install requirements.txt 
```

## Getting Started

**Acessing Dataset**

The processed datasets will be released after the review period. The dataset should be organized as follows:

```
RDesign
├── API
├── assets
├── checkpoints
├── methods
├── model
└── data
    ├── RNAsolo
    │   ├── train_data.pt
    │   ├── val_data.pt
    │   ├── test_data.pt
```

**Model Training**

Execute the following command to run both training and testing:

```shell
python main.py --epoch 200 --batch_size 64 --seed 111 --lr 0.001 --hidden 128 --weigth_clu_con 0.5 --weigth_sam_con 0.5 --ss_temp 0.5
```

The hyperparameters `weight_clu_con` and `weight_sam_con` are the weights of the representation loss $L_{cluster}$ and $L_{sample}$.

After training, the checkpoint, log file, and hyperparameters will be stored in `./checkpoints/`, organized as follows:

```
RDesign
├── checkpoints
  ├── checkpoint.pth
  ├── log.log
  ├── model_param.json
```

**Load the Model**

To load the pre-trained model, we can run the following code:

```python
import json
import argparse
from .main import Exp

config = json.load(open('./checkpoints/model_param.json'), 'r')
args = argparse.Namespace(**config)
exp = Exp(args)
exp.method.model.load_state_dict('./checkpoints/checkpoint.pth')
exp.test()
```


## Supported Models, Datasets, and Evaluation Metrics


  <details open>
    <summary>Currently supported datasets</summary>

  To download the processed datasets, please click [here]()

  - [X] Our proposed dataset
  - [X] RFAM
  - [X] RNA_PUZZLE

  </details>

  <details open>
    <summary>Currently supported evaluation metrics</summary>

  - [X] Recovery
  - [X] Macro F1-score
  </details>


## Citation

```
TBD
```