Towards functional annotation with latent protein language model features

Towards functional annotation with latent protein language model features

ICML 2025 Workshop FM4LS Submission63 Authors

Published: 12 Jul 2025, Last Modified: 12 Jul 2025FM4LS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Proteins, computational biology, interpretability, sparse autoencoders

TL;DR: We use protein SAEs to curate features for protein annotation, showing we can identify granular subdomains, find missing database annotations at scale, and structurally match metagenomic proteins.

Abstract: Protein Language Models (PLMs) create high-dimensional embeddings that can be transformed into interpretable features using Sparse Autoencoders (SAEs), where each feature activates on specific protein patterns. However, scalably identifying which features are reliable enough for protein annotation remains challenging. We address this by developing a pipeline combining three complementary methods: (1) expanded database matching across 20+ annotation sources, (2) feature-guided local structural alignment to identify consistent activation regions, and (3) LLM-based feature description generation. Our annotation pipeline demonstrates three properties of SAE features that make them a useful source for functional annotation. First, they can represent more granular patterns than existing annotations, enabling the identification of sub-domains. Second, they can detect missing annotations by finding proteins that display recognizable structural motifs but lack corresponding labels. Here, we identify at least 491 missing CATH topology annotations with our pipeline. Third, they can maintain structural consistency across unseen proteins. Of our 10,240 SAE features, we find 615 that are structurally similar in unannotated metagenomic proteins, allowing us to match at least 8,077 metagenomic proteins to characterized proteins. This provides a rapid annotation pipeline with constant time search, that automatically includes structural and functional information about the feature that triggered the match.

Submission Number: 63

Loading