ProtSent: Protein Sentence Transformers

Dan Ofer; Oriel Perets; Michal Linial; Nadav Rappoport

ProtSent: Protein Sentence Transformers

Dan Ofer, Oriel Perets, Michal Linial, Nadav Rappoport

Published: 28 May 2026, Last Modified: 28 May 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Proteins, Sentence Transformers, ESM, Protein Language Models, pLM, sentencebert

TL;DR: Protein sentence transformer models trained using 5 datasets and evaluated on 24 downstream tasks

Abstract: Protein language models produce representations that capture evolutionary and structural information, yet their sequence-level embeddings are not trained to reflect biological similarity between proteins. We present ProtSent, a contrastive fine-tuning framework that adapts protein language models into general-purpose embedding models using MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold~DB structural pairs, StringDB interactions, and deep mutational scanning data. Evaluated on 23 downstream tasks with a frozen k-nearest-neighbor probe, ProtSent on ESM-2 150M improves 15 of 23 tasks, with +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 retrieval. We release the models, code, and training recipe.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 111

Loading