Protein structural superfamily classification using hand-crafted and language model features: A performance vs interpretability trade-off
Abstract: The CATH database categorizes more than 600,000 protein domain structures into superfamilies based on a hierarchy of structural similarity notions. Members of a single superfamily may share less than 35% sequence similarity. The scale of such data motivates the use of machine learning methods that can accurately predict the CATH superfamily of a protein domain and, at the same time, are interpretable, i.e. provide insights into the characteristic features of a superfamily. The newfound rise of protein language models (PLMs) that leverage data and compute has introduced an interesting conflict: a trade-off between the high predictive performance of non-interpretable features and the scientific insight that can be gained from interpretable, hand-crafted ones. In this work, we highlight and study this conflict via the task of classifying protein domains into their CATH superfamilies. We train one-vs-all (OvA) linear SVM classifiers for 45 diverse CATH superfamilies, each characterised by significant class imbalance. We address the class imbalance by using a class-balanced loss function and the arithmetic mean (AM) of specificity and sensitivity for evaluation. Our analysis compares nine feature vector types, which are either non-interpretable embeddings from PLMs or interpretable hand-crafted features. The latter includes amino acid composition (AAC), di- and tri-peptide composition (DPC, TPC), and novel sequence-order (2OAAC, 3OAAC) and structure-based features (OCPC, CSIC). Our results demonstrate that PLM-based features achieve superior test AM scores of 90-99% with low variability, outperforming hand-crafted features by 20-30%. While PLM features yield high classification accuracy, their lack of interpretability obscures the underlying biological determinants. Conversely, the interpretability of hand-crafted features, despite their relatively low performance, can be leveraged to infer sequence and structural characteristics of CATH superfamilies. We illustrate this for two superfamilies. First, we rank the components of hand-crafted features using a known method, marginal contribution feature importance (MCI). Then, based on the interpretability of the top-ranked hand-crafted feature components, we derive biological insights, such as characteristic contacts of superfamily structures. The proposed hand-crafted CSIC feature strikes a balance between predictive performance and interpretability, as it overfits less while providing rich structural information about contact sequence separation. This can be valuable for downstream applications, such as investigating protein-related diseases and guiding rational protein design.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 7514
Loading