# DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome



## Contents

- [1. Introduction](#1-introduction)

- [2. Model and Data](#2-model-and-data)

- [3. Setup Environment](#3-setup-environment)

- [4. Quick Start](#4-quick-start)

- [5. Finetune](#5-finetune)

  



## 1. Introduction

DNABERT-2 is a foundation model trained on large-scale multi-species genome that achieves the state-of-the-art performanan on $28$ tasks of the GUE benchmark. It replaces k-mer tokenization with BPE, positional embedding with Attention with Linear Bias (ALiBi), and incorporate other techniques to improve the efficiency and effectiveness of DNABERT.



## 2. Model and Data

Please download model and data on google drive from the following link.



Model: https://anonymfile.com/47ar/model.zip

Data: https://anonymfile.com/m4LB/gue.zip



### 2.1 GUE: Genome Understanding Evaluation

GUE is a comprehensive benchmark for genome understanding consising of $28$ distinct datasets across $7$ tasks and $4$ species. GUE can be download find as a zip file. Statistics and model performances on GUE is shown as follows:





## 3. Setup environment

    # create and activate virtual python environment
    conda create -n dna python=3.8
    conda activate dna


​    
​    # install required packages
​    python3 -m pip install -r requirements.txt





## 4. Quick Start

Our model is easy to use with the [transformers](https://github.com/huggingface/transformers) package.

```
# unzip the model and data
unzip model.zip
unzip GUE.zip
```






To load the model:
```python


import torch
from transformers import AutoTokenizer, AutoModel

model_dir = "/ABSOLUTE/PATH/TO/MODEL/FOLDER"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True)
```


To calculate the embedding of a dna sequence
```
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768
```





## 5. Finetune

### 5.1 Evaluate models on GUE



Current script is set to use `DataParallel` for training on 4 GPUs. If you have different number of GPUs, please change the `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to adjust the global batch size to 32 to replicate the results in the paper. If you would like to perform distributed multi-gpu training (e.g., with `DistributedDataParallel`), simply change `python` to `torchrun --nproc_per_node ${n_gpu}`.


```
export DATA_PATH=/path/to/GUE #(e.g., /home/user)
export MODEL_PATH=/path/to/model #(e.g., /home/user/DNABERT_2)
cd finetune

# Evaluate DNABERT-2 on GUE
sh scripts/run_dnabert2.sh DATA_PATH MODEL_PATH

# Evaluate DNABERT (e.g., DNABERT with 3-mer) on GUE
# 3 for 3-mer, 4 for 4-mer, 5 for 5-mer, 6 for 6-mer
sh scripts/run_dnabert1.sh DATA_PATH 3

# Evaluate Nucleotide Transformers on GUE
# 0 for 500m-1000g, 1 for 500m-human-ref, 2 for 2.5b-1000g, 3 for 2.5b-multi-species
sh scripts/run_nt.sh DATA_PATH 0

```

