contrastive_rna_representation
================

``` sh
mamba env create -f environment.yml
cd contrastive_rna_representation
pip install -e .
```

## Technical files and their functionality

#### bpnet_dilated_conv.py
This module defines the basic building block of the DilatedConvNet - DilatedConv1DBasic

#### contrastive_model.py
This module runs the contrastive training. In it we initialize `ContrastiveModel` as well as the corresponding trainer and the losses. We can choose between DCL loss and NTXent loss. The script is run from `slurm/array_train_contrast.sh`. For us to run the script we need to construct the dataset as defined by dataset_path. This is done with `wite_contrastive_tf_record_dataset.py`.

#### data.py
This module defines the basic data objects required to construct the dataset. It defines the `RefseqDataset`, `Transcript` and the `Interval` class which are used to define the genomic intervals which create the Introns and Exons making up the dataset.

#### gene_dataset.py
This module defines the dataset construction functions that are used to construct the contrastive datasets used for training. They take in lists of annotated transcripts acquired from UCSC genome table browser. Those files indicate coordinates of transcripts in the reference genome. We then use the fundamental class objects like `Transcript` to construct these genomic objects and query their coordinates.

#### go_train.py
This module is for training and fine-tuning dilated ResNet models on the GO prediction tasks.

#### resnet.py
This module instantiates the ResNet models with dilation for RNA half-life prediction.

#### rna_half_life_trainer.py
This module is used for training and fine-tuning dilated ResNet models for RNA half-life prediction - the task that is used in the Saluki publication.

#### saluki_dataset.py
This module is from the Saluki publication which defines the dataset used for loading the RNA half-life data from the TFRecords.

#### saluki_layers.py
This module defines the Saluki model and the extra layers that they used in the model definition. This includes the shift and the scale layers.

#### util.py
This module includes the miscellaneous functions that are used across different parts of the codebase. Dump anything annoying in here.

#### write_contrastive_tf_record_dataset.py
This module creates the TF record dataset used for contrastive learning. It requires the fasta files associated with corresponding genomes, the transcriptome files annotating the locations of the transcripts as well as exon boundaries and also any additional homology files for creating the mapping within / between species.
