# Data preparation of Noisy-DNA dataset

Data can be obtained from Antkowiak et al. 2020

0. run `get_indices.py` -> `indices.txt`

1. run `index_clustering.py` -> `index_clusters.csv`

2. run `filter_index_clusters.py` -> `train_val_data_{SC,LC}.txt` and `test_reads_{SC,LC}.txt`

3. run `train_val_split.py` -> `train_data_{SC,LC}.txt` and `val_data_{SC,LC}.txt`

4. run `./starcode --print-clusters -d6 -s -i data/test_reads_{SC,LC}.txt -o data/starcode_clusters_{SC,LC}.txt`
    (starcode can be obtained from Zorita et al. 2015)

5. run `starcode_to_cred.py` -> `starcode_test_cpred_data_{SC,LC}.txt`

6. run `filter_sequences.py` 

7. run `check_duplicates.py`
