CypST: Improving Cytochrome P450 Substrates Prediction with Fine-Tuned Protein Language Model and Graph Attention Network

Yao Wei; Uliano Guerrini; Ivano Eberini

CypST: Improving Cytochrome P450 Substrates Prediction with Fine-Tuned Protein Language Model and Graph Attention Network

Yao Wei, Uliano Guerrini, Ivano Eberini

25 Sept 2024 (modified: 01 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Molecular graph attention networks, Protein language model, Deep learning, Enzyme substrate prediction

TL;DR: We trained an ESM Transformer model to generate protein representations and a graph attention network to derive molecular representations for predicting cytochrome P450 enzyme substrates

Abstract: Cytochrome P450s (CYP450s) are key enzymes involved in human xenobiotics metabolism. So it is critical to make accurate CYP450s substrate predictions for drug discovery and chemical toxicology study. Recent deep learning-based approaches indicated that directly leverage extensive information from proteins and chemicals in biological and chemical databases to predict enzyme-substrate interactions, have achieved remarkable performance. Here, we present CypST, a deep learning-based model that enhances these methods by pre-trained ESM-2 Transformer model to extract detailed CYP450 protein representations and by incorporating our fine-tuned graph attention networks (GATs) for more effective learning on molecular graphs. GATs regard molecular graphs as sets of nodes or edges, with connectivity enforced by masking the attention weight matrix, creating custom attention patterns for each graph. This approach captures key molecular interactions, improving prediction ability for substrates. CypST effectively recognizes substructural interactions, constructing a comprehensive molecular representation through multi-substructural feature extraction. By pre-training on a large-scale experimental enzyme-substrate pair database and fine-tuning on 51,753 CYP450s enzyme-substrate and 27,857 CYP450s enzyme-non-substrate pairs, CypST focuses on five major human CYP450 isforms, achieving 0.861 accuracy and 0.909 AUROC and demonstrating strong generalizability to novel compounds for different CYP450 isoforms.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4496

Loading