Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection
Abstract: Paralogous proteins have a common ancestor but have diverged in functionality. Using known machine learning algorithms, we present a data-driven method to identify the key amino acid types that play a role in distinguishing a given pair of proteins that are paralogs. We use an existing Shapley value based feature subset selection algorithm, SVEA, to identify the key amino acid types adequate to distinguish pairs of paralogous proteins. We refer to these as the amino acid feature subset ($AFS$). For a paralog pair, say proteins $P$ and $Q$, its $AFS$ is partitioned based on protein-wise importance as $AFS(P)$ and $AFS(Q)$ using a linear classifier, SVM. To validate the significance of the $AFS$ amino acids, we use multiple domain knowledge based methods : (a) multiple sequence alignment, and/or (b) 3D structure analysis, and/or (c) supporting evidence from biology literature. This method is computationally cheap, requires less data and can be used as an initial data-driven step for further hypothesis-driven experimental study of proteins. We demonstrate the results for 15 pairs of paralogous proteins.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The manuscript is revised with the following changes (highlighted in cyan):
- Research question added in intro para 2.
- New Appx. Sec. E.4 Variance in Shapley value estimates and AFS, page 32. Includes Table 7 and Fig. 13.
- Summary for this section added in Sec. 2.2, page 4.
- New Appx. Sec. E.5 AFS comparison with random feature subsets, page 34. Includes Table 8 and Fig. 14.
- Summary for this section added in New Sec. 3.2.1, page 11.
- Appx. Sec. E.3 is updated with details on MCI.
Other minor editorial corrections were made (not highlighted).
The code repo has also been updated with the code for reproducing the added computations.
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 4968
Loading