# mutation_density
Cancer specific mutation density prediction project

## Dependencies:
- pytorch
- gpytorch
- tensorboardX
- tqdm
- argparse
- h5py
- datetime
- sklearn
- numpy 
- pandas
- matplotlib
- scipy
- re

**Note: to setup the working environment from scratch construct an Anaconda python 3 environment and pip install all of the missing libraries above.**

## Hyperparameters
### Model:
- Cancer IDs: a list of cancer type IDs corresponding to the keys of the input h5 data file (e.g. SNV_skin_melanoma_MELU_AU, SNV_liver_ALL, etc.). If more than one ID is given the network will predict a separate feature vector to each cancer (separated only in the last feature vector), but will select the best run only according to the accuracy of the 1st cancer ID.
- Mappability: a lower bound threshold for regions mappability. Only 10kb regions with equal or larger mappability scores will be taking into consideration.
- Predictor tracks selection: an optional epigentics predictor trakcs selection file. Allows an easy input tracks subset selection via a file of track indices (for example see: https://github.com/AdamYaari/mutation_density/blob/master/track_selection_files/skin_tracks.txt).
- Split method: if 'random' - choose train and test sets fully at random, if 'chr' - deivide each chromosome to train and test regions with no overlap.
- Attention maps: if true - jointly train an attention subnetwork which produces an input size (row-wise softmax) attention map that is picewise multiplied by the input matrix before passed on to the network. The attention maps for the best run will be saved as 'attention_maps.h5' under the output directory.
- Run gaussian process: if true - run a full NN+GP run to produce region-wise prediction and confidence interval. The GP will be trained on the last layer latent vectors from the best NN run. The GP results for the best run will be saved as 'gp_results.h5' under the output directory.

### Train:
- Held-out ratio: 0-1 number of the ratio of data to set as held-out data which will only be used for a final evaluation of the model (default: 0.2).
- Train ratio: 0-1 number of the ratio of data to set as training data. Training and test (validation) set ratios are taken after selecting the held-out set (default: 0.8).
- Batch size: number of samples per batch (default: 128).
- Epochs #: number of epochs to train the model (default: 20).
- Reruns #: number of model reruns (i.e. reinitialization+training). All output results will be taken from the best performing run (default: 1).

### Files:
- Data file: the path to the h5 input data file. The file should contain the following datasets:
  - Epigentic predictors data: all of the relevant epigentic tracks under key 'x_data' and in the following shape (# samples, region length, # tracks).
  - Mutation counts: all of the relevant cancer mutation counts. Each cancer type should have an independent dataset (1D numpy array) to allow selection by the dataset key.
  - Chromosome regions: the chromosome location of each input region under the key 'idx', given by the trio chromosome #, start index, and end index (tab separated).
  - Mappability values: the mappability values for all regions under the key 'mappability'.
- Output directory: the output directory for the model. Two nested subdirectories will be created under the given output directory: 1) cancer IDs directory (concatenated with a hyphen if multiple cancer types are requested), 2) date and time directory to store multiple runs of the same type.
- Held-out set file: an optional file to provide IDs of specfice regions to keep outside of the train and test datasets. Performance over the held-out regions will be measured after the full training procedure with only the best model according to the test accuracy (save model parameter must be set to true to perform measurement). The file should be a tab separted txt file with the following headers: CHROM	START	END	Y_TRUE	Y_PRED	STD	PVAL	RANK.

### General:
- GPUs: 'all' to run on all visible GPU devices, comma separated integers to run on multiple specified GPUs, and a single integer to run on just one specific GPU (default: 'all').
- Save model: true to save the best perfomrming model (default: false).
- Save training: true to save training performance as a tensorboard file (default: false).

## Neural network training
### Single run usage
```
usage: mutations_main.py [-h] -c [LABEL_IDS [LABEL_IDS ...]] [-d [DATA_FILE]]
                         [-o [OUT_DIR]] [-u [HELDOUT_FILE]] [-t [TRACK_FILE]]
                         [-s [SPLIT_METHOD]] [-m [MAPPABILITY]]
                         [-a [GET_ATTENTION]] [-gp [RUN_GAUSSIAN]]
                         [-r [TRAIN_RATIO]] [-ho [HELDOUT_RATIO]]
                         [-e [EPOCHS]] [-b [BS]] [-re [RERUNS]]
                         [-sm [SAVE_MODEL]] [-st [SAVE_TRAINING]] [-g [GPUS]]

optional arguments:
  -h, --help            show this help message and exit
  -c [LABEL_IDS [LABEL_IDS ...]], --cancer-id [LABEL_IDS [LABEL_IDS ...]]
                        A list of the h5 file mutation count dataset IDs (e.g.
                        SNV_skin_melanoma_MELAU_AU)
  -d [DATA_FILE], --data [DATA_FILE]
                        Path to h5 data file
  -o [OUT_DIR], --out-dir [OUT_DIR]
                        Path to output directory
  -u [HELDOUT_FILE], --held-out [HELDOUT_FILE]
                        Path to file of held-out samples file
  -t [TRACK_FILE], --tracks [TRACK_FILE]
                        Path to predictor tracks selection file
  -s [SPLIT_METHOD], --split [SPLIT_METHOD]
                        Dataset split method (random/chr)
  -m [MAPPABILITY], --mappability [MAPPABILITY]
                        Mappability lower bound
  -a [GET_ATTENTION], --attention [GET_ATTENTION]
                        True: train with attention map training
  -gp [RUN_GAUSSIAN], --gaussian [RUN_GAUSSIAN]
                        True: train gaussian process regression on the best
                        performing model
  -r [TRAIN_RATIO], --train-ratio [TRAIN_RATIO]
                        Train set split size ratio
  -ho [HELDOUT_RATIO], --heldout-ratio [HELDOUT_RATIO]
                        Held-out set split size ratio (will be extracted prior
                        to train validation split)
  -e [EPOCHS], --epochs [EPOCHS]
                        Number of epochs
  -b [BS], --batch [BS]
                        Batch size
  -re [RERUNS], --reruns [RERUNS]
                        Number of model reinitializations and training runs
  -sm [SAVE_MODEL], --save-model [SAVE_MODEL]
                        True: save best model across all reruns
  -st [SAVE_TRAINING], --save-training [SAVE_TRAINING]
                        True: save training process and results to Tensorboard
                        file
  -g [GPUS], --gpus [GPUS]
                        GPUs devices (all/comma separted list)
```

### Full k-fold usage:
```
usage: kfold_mutation_main.py [-h] -c [LABEL_IDS [LABEL_IDS ...]]
                             [-d [DATA_FILE]] [-o [OUT_DIR]] [-t [TRACK_FILE]]
                             [-s [SPLIT_METHOD]] [-m [MAPPABILITY]] [-k [K]]
                             [-e [EPOCHS]] [-b [BS]] [-g [GPUS]]
                             [-sm [SAVE_MODEL]] [-st [SAVE_TRAINING]]
                             [-re [RERUNS]]

optional arguments:
  -h, --help            show this help message and exit
  -c [LABEL_IDS [LABEL_IDS ...]], --cancer-id [LABEL_IDS [LABEL_IDS ...]]
                        A list of the h5 file mutation count dataset IDs (e.g.
                        SNV_skin_melanoma_MELAU_AU)
  -d [DATA_FILE], --data [DATA_FILE]
                        Path to h5 data file
  -o [OUT_DIR], --out-dir [OUT_DIR]
                        Path to output directory
  -t [TRACK_FILE], --tracks [TRACK_FILE]
                        Path to predictor tracks selection file
  -s [SPLIT_METHOD], --split [SPLIT_METHOD]
                        Dataset split method (random/chr)
  -m [MAPPABILITY], --mappability [MAPPABILITY]
                        Mappability lower bound
  -gp [RUN_GAUSSIAN], --gaussian [RUN_GAUSSIAN]
                        True: train gaussian process regression on the best
                        performing model
  -k [K]                Number of folds
  -e [EPOCHS], --epochs [EPOCHS]
                        Number of epochs
  -b [BS], --batch [BS]
                        Batch size
  -g [GPUS], --gpus [GPUS]
                        GPUs devices (all/comma separted list)
  -sm [SAVE_MODEL], --save-model [SAVE_MODEL]
                        True: save best model across all reruns
  -st [SAVE_TRAINING], --save-training [SAVE_TRAINING]
                        True: save training process and results to Tensorboard
                        file
  -re [RERUNS], --reruns [RERUNS]
                        Number of model reinitializations and training runs
```

