Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection

Pranav Machingal; Rakesh Busi; Nandyala Hemachandra; Petety V. Balaji

Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection

Pranav Machingal, Rakesh Busi, Nandyala Hemachandra, Petety V. Balaji

23 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: A data-driven method to identify the key amino acid types that play a role in distinguishing paralogous proteins (proteins with a common ancestor but diverged in functionality) from each other.

Abstract: Paralogous proteins have a common ancestor but have diverged in functionality. Using known machine learning algorithms, we present a data-driven method to identify the key amino acid types that play a role in distinguishing a given pair of proteins that are paralogs. We use an existing Shapley value based feature subset selection algorithm, SVEA, to identify the key amino acid types adequate to distinguish pairs of paralogous proteins. We refer to these as the amino acid feature subset ($AFS$). For a paralog pair, say proteins $P$ and $Q$, its $AFS$ is partitioned based on protein-wise importance as $AFS(P)$ and $AFS(Q)$ using a linear classifier, SVM. To validate the significance of the $AFS$ amino acids, we use multiple domain knowledge based methods : (a) multiple sequence alignment, and/or (b) 3D structure analysis, and/or (c) supporting evidence from biology literature. This method is computationally cheap, requires less data and can be used as an initial data-driven step for further hypothesis-driven experimental study of proteins. We demonstrate the results for 15 pairs of paralogous proteins.

Primary Area: Applications->Chemistry, Physics, and Earth Sciences

Keywords: proteins, paralogs, feature subset selection, classification, Shapley values, SVM

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Submission Number: 10049

Loading