Learning temperature-aware representations from millions of annotated protein sequences

Published: 11 Oct 2024, Last Modified: 02 Nov 2024Neurips 2024 Workshop FM4Science OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein, Pre-trained Protein Models, Protein Temperature Prediction
TL;DR: This work introduces ThermoFormer, which is pre-trained on 96 millions of protein sequences annotated with optimal growth temperature.
Abstract: Temperature plays a dominant environmental role in determining the efficiency of protein function. Accurately predicting the thermal stability of proteins is crucial for fundamental biology, drug discovery, and protein engineering. Here, we introduce ThermoFormer, a transformer-based protein language model that learns both temperature-aware representation and sequence patterns. Specifically, we first build a large-scale dataset comprising more than 96 million protein sequences anno-tated with their optimal growth temperature (OGT). ThermoFormer is pre-trained with a supervised OGT prediction task and an unsupervised masked language modeling (MLM) task on the dataset. We evaluated the performance of Thermo- Former on the pre-training and the performance of transferring ThermoFormer to other temperature prediction datasets, including two melting temperature (TM) datasets and an optimal catalytic temperature (OCT) dataset. The results show that ThermoFormer is able to achieve state-of-the-art performance in both OGT, TM, and OCT prediction tasks, outperforming previous unsupervised pre-trained models. In addition, we have also shown that ThermoFormer enables zero-shot temperature prediction, i.e., even without further fine-tuning, ThermoFormer can still achieve comparable performance. We believe that ThermoFormer can serve as a foundational model for encoding protein sequences with temperature-aware representations, providing better transfer ability for temperature-related down-stream tasks. The datasets, model weights, and source codes are available at https://github.com/ginnm/ThermoFormer.
Submission Number: 18
Loading