Enhancing Protein Language Models with Feature Integration for Anticancer Peptide Prediction

Tiara Natasha Binte Sayuti; Shen Cheng; Santhisenan Ajith; Abdul Hadi Bin Abdul Samad; Jagath Rajapakse

Enhancing Protein Language Models with Feature Integration for Anticancer Peptide Prediction

Tiara Natasha Binte Sayuti, Shen Cheng, Santhisenan Ajith, Abdul Hadi Bin Abdul Samad, Jagath Rajapakse

Published: 25 Sept 2024, Last Modified: 21 Oct 2024IEEE BHI'24EveryoneRevisionsBibTeXCC BY 4.0

Keywords: amino-acid compositions, anticancer peptide prediction, cross-attention, protein language models

TL;DR: The paper introduces amino acid sequence-based features to protein language models to enhance prediction of anticancer peptides

Abstract: In the fight against cancer, anticancer peptides (ACP) hold promising therapeutic potential due to their selective cytotoxicity and lower side effects compared to traditional treatments. However, identifying novel ACP is challenged by high costs and labor-intensive processes. Protein language models (PLMs), such as ESM-2 and ProtBERT, have revolutionized peptide prediction by leveraging vast datasets to capture complex biological patterns through pre-training. However, they often struggle to accurately model specific biochemical interactions. To address this limitation, we integrated four sequence-based features: amino acid composition (AAC), dipeptide composition (DPC), composition of k-spaced amino acid group pairs (CKS), and k-mer sparse matrix (k-mer) through a cross-attention mechanism. These features infuse biochemical insights that PLM alone may overlook, enabling a more detailed prediction of anticancer properties. This integration enhances biochemical insights, improving prediction accuracy by 15.8\% for ProtBERT and 2.9\% for ESM-2, with ESM-2 achieving the highest accuracy at 77.8\%. SHapley Additive exPlanations (SHAP) analysis confirms the importance of these features, demonstrating that incorporating amino acid features into PLMs enhances ACP prediction.

Track: 2. Large Language Models for biomedical and clinical research

Registration Id: DCNWLGDJHGJ

Submission Number: 99

Loading