A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction

Anowarul Kabir, Asher Moldwin, Amarda Shehu

Published: 2023, Last Modified: 07 May 2026BCB 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Protein language models based on the transformer architecture are increasingly shown to learn rich representations from protein sequences that improve performance on a variety of downstream protein prediction tasks. These tasks encompass a wide range of predictions, including prediction of secondary structure, subcellular localization, evolutionary relationships within protein families, as well as superfamily and family membership. There is recent evidence that such models also implicitly learn structural information. In this paper we put this to the test on a hallmark problem in computational biology, remote homology prediction. We employ a rigorous setting, where, by lowering sequence identity, we clarify whether the problem of remote homology prediction has been solved. Among various interesting findings, we report that current state-of-the-art, large models are still underperforming in the "twilight zone" of very low sequence identity.

External IDs:dblp:conf/bcb/KabirMS23