1/28/26

ABOUT:
======
This processes Pfam-A.seed into inputs for training



REQUIREMENTS:
==============

Database, Software Versions
-----------------------------
- FTP from [Pfam v36.0](https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam36.0/)  
- FastTree version 2.1.11 No SSE3 to impute missing trees 

Python packages
----------------
- Python 3.9.18, standard packages that come with a miniconda installation
- Jax 0.4.28: for precomputing transition and emission counts (for pairHMM inputs).
- Biopython 1.81: for processing phlogenetic trees  



To run pipeline on example data, unzip EXAMPLE_INPUTS.zip and run:
===================================================================
	python clean_data.py -pfam_seed_file \
		EXAMPLE_INPUTS/EXAMPLE_Pfam-A.seed \
		-tree_dir EXAMPLE_INPUTS/trees/ \
		-num_splits 2 \
		-topk1_valid 0 \
		-topk2_valid 0 \
