Keywords: amino-acid compositions, anticancer peptide prediction, cross-attention, protein language models
TL;DR: The paper introduces amino acid sequence-based features to protein language models to enhance prediction of anticancer peptides
Abstract: In the fight against cancer, anticancer peptides (ACP) hold promising therapeutic potential due to their selective cytotoxicity and lower side effects compared to traditional treatments. However, identifying novel ACP is challenged by high costs and labor-intensive processes. Protein language models (PLMs), such as ESM-2 and ProtBERT, have revolutionized peptide prediction by leveraging vast datasets to capture complex biological patterns through pre-training. However, they often struggle to accurately model specific biochemical interactions. To address this limitation, we integrated four sequence-based features: amino acid composition (AAC), dipeptide composition (DPC), composition of k-spaced amino acid group pairs (CKS), and k-mer sparse matrix (k-mer) through a cross-attention mechanism. These features infuse biochemical insights that PLM alone may overlook, enabling a more detailed prediction of anticancer properties. This integration enhances biochemical insights, improving prediction accuracy by 15.8\% for ProtBERT and 2.9\% for ESM-2, with ESM-2 achieving the highest accuracy at 77.8\%. SHapley Additive exPlanations (SHAP) analysis confirms the importance of these features, demonstrating that incorporating amino acid features into PLMs enhances ACP prediction.
Track: 2. Large Language Models for biomedical and clinical research
Registration Id: DCNWLGDJHGJ
Submission Number: 99
Loading