ProtSent: Protein Sentence Transformers
Keywords: Proteins, Sentence Transformers, ESM, Protein Language Models, pLM, sentencebert
TL;DR: Protein sentence transformer models trained using 5 datasets and evaluated on 24 downstream tasks
Abstract: Protein language models produce representations that capture evolutionary and structural information, yet their sequence-level embeddings are not trained to reflect biological similarity between proteins. We present ProtSent, a contrastive fine-tuning framework that adapts protein language models into general-purpose embedding models using MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold~DB structural pairs, StringDB interactions, and deep mutational scanning data. Evaluated on 23 downstream tasks with a frozen k-nearest-neighbor probe, ProtSent on ESM-2 150M improves 15 of 23 tasks, with +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 retrieval. We release the models, code, and training recipe.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 111
Loading