# TReconLM

# Installation

The code hase been ested on `Ubuntu 22.04.4 LTS`.

Create conda environment: `conda env create -f TReconLM.yml`

For using FlashAttention (https://github.com/pytorch/pytorch/issues/119054)
`pip install nvidia-cuda-nvcc-cu11`
`export TRITON_PTXAS_PATH=/opt/conda/envs/TReconLM/lib/python3.11/site-packages/nvidia/cuda_nvcc/bin/ptxas`
`export PYTHONPATH="${PYTHONPATH}:/path/TReconLM/src`

# Structure

Parameters for the experiments can be found in `src/hydra` which is oraganized as follows

1. `train_config`: contains all parameters for training including model definiton training hyperparameters

2. `inference_config`: contains all configurations for running trace reconstruction experiments in `src/eval_pkg/TR.py`

3. `data_config`: config for generating synthetic data (IDS channel) for testing

# Training

## Pretraining

in src run: `python pretrain.py exps=...` choose pretraining experiment from src/hydra/config/train_config/exps

Example: `python pretrain.py exps=test/pretrain_scratch`

Pretraining experiments: `ids_{60,110,180}nt`, test

Pretraining data is generated during the training process. 

Examples for training times:

1. Training a 20M model on 16M with sequence length `L=60` and cluster size `N=5` instances on one RTXA6000 takes approx. 1 day

2. Training a 300M model on 32M instances with sequence length `L=60` and cluster size `N=5` on two H100 takes approx. 3 days

## Finetuning

in `src` run: `python finetune.py exps=...` choose pretraining experiment from src/hydra/config/train_config/exps

Finetuning experiments: `microsoft_data`, `noisy_dna`

# Inference

in `src/eval_pkg` run: `python TR.py exps=...` choose experiment from src/hydra/config/inference_config/exps

# Data

## Noisy-DNA dataset 

see `data/noisy_dna/noisyDNA_README.md`

## Microsoft dataset

see `data/microsoft_data/Microsoft_data_README.md`

# Baselines 

The implementation for the VS, BMALA, and ITR can be found in Sabary et al. 2020. The code for MUSCLE can be found in Edgar 2004. 