# GENATATORs

Here is the description of the code for GENATATORs de novo gene annotation with DNA language models.  

| Workflow                                 | Folder              | Key files                                                                                          |
|------------------------------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Fine-tune GENA-LM & Caduceus-PH/PS       | `downstream_tasks/caduceus_gena/`            | you need `.sh` files with `_CADUCEUS` and `_UNET_segmented` substrings in order to run finetuning |
| Inference with Segment NT (human & multi)| `downstream_tasks/nt/`                       | `nt_pred.py` (run  inference), `metrics.sh` (compute exon/gene‐level metrics)               |
| Train linear head on Evo 2 embeddings    | `downstream_tasks/evo2/`                     | `run_evo2.sh` (extract Evo2 embeddings), `train_linear_layer_evo2.py` (fit linear classifier)               |
| Build training/evaluation datasets       | `downstream_tasks/dataset/`                  | `make_dataset_human.py`, `dataset_human.sh` |

---

## Pre-trained Models

| Model | Hugging Face link |
|-------|----------------------------|
| Caduceus-PS |  <https://huggingface.co/kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16> |
| Caduceus-PH | <https://huggingface.co/kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16> |
| GENA-LM (base) |  <https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t> |
| GENA-LM (large) | <https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t> |
| Evo 2 (1 B params) |  <https://huggingface.co/arcinstitute/evo2_1b_base> |
| Segment NT (human) |  <https://huggingface.co/InstaDeepAI/segment_nt> |
| Segment NT (multispecies) | <https://huggingface.co/InstaDeepAI/segment_nt_multi_species> |
| Segment Borzoi | <https://huggingface.co/InstaDeepAI/segment_borzoi> |
| Segment Enformer | <https://huggingface.co/InstaDeepAI/segment_enformer> |

## Tiberius and AUGUSTUS

- **Tiberius** (end‐to‐end CNN + HMM gene prediction) – GitHub:  
  <https://github.com/Gaius-Augustus/Tiberius>

- **AUGUSTUS** (HMM‐based ab initio gene finder) – Web server:  
  <https://bioinf.uni-greifswald.de/augustus/>

To load any model:

```python
from transformers import AutoModel, AutoTokenizer

model_name = "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModel.from_pretrained(model_name, trust_remote_code=True)
