Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection
Abstract: Paralogous proteins have a common ancestor but have diverged in functionality. Using known machine learning algorithms, we present a data-driven method to identify the key amino acid types that play a role in distinguishing a given pair of proteins that are paralogs. We use an existing Shapley value based feature subset selection algorithm, SVEA, to identify the key amino acid types adequate to distinguish pairs of paralogous proteins. We refer to these as the amino acid feature subset ($AFS$). For a paralog pair, say proteins $P$ and $Q$, its $AFS$ is partitioned based on protein-wise importance as $AFS(P)$ and $AFS(Q)$ using a linear classifier, SVM. To validate the significance of the $AFS$ amino acids, we use multiple domain knowledge based methods : (a) multiple sequence alignment, and/or (b) 3D structure analysis, and/or (c) supporting evidence from biology literature. This method is computationally cheap, requires less data and can be used as an initial data-driven step for further hypothesis-driven experimental study of proteins. We demonstrate the results for 15 pairs of paralogous proteins.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: One of the six subfigures in the appendix (section E, Figure 7 (a)), which was incorrectly pasted, has been corrected.
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 4968
Loading