LEARNING LIGHTWEIGHT STRUCTURE-AWARE EMBEDDINGS FOR PROTEIN SEQUENCESDownload PDF

01 Mar 2023 (modified: 31 May 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone
Keywords: Machine Learning, Bioinformatics, Embeddings
TL;DR: We created and evaluated lightweight structure-aware embeddings using the S4PRED model and the ESMFold model as oracles.
Abstract: Machine learning models, such as AlphaFold, have recently demonstrated remarkable accuracy in predicting the structures of protein sequences. This capability enables their use as oracles for providing structure-based information to aid other learning tasks. In this study, we investigate the use of deep learning embeddings and explore the feasibility of developing a structure-aware protein sequence embedding. To accomplish this, we employ S4PRED and ESMFold, two models that predict protein secondary (2D) and tertiary structures (3D) respectively, directly from single sequences. These models act as oracles to form structure-aware embeddings through an autoencoder. We then compare this approach to purely sequence-based embeddings in a Protein-Protein Interaction (PPI)-prediction task. Our findings highlight the potential advantages of employing structure embeddings and provide grounds for future research directions.
10 Replies

Loading