Exploiting Topology of Protein Language Model Attention Maps for Token Classification

Maria Ivanova; Ilya Trofimov; Pavel Strashnov; Nikita Ivanisenko; Serguei Barannikov; Evgeny Burnaev; Olga Kardymon

Exploiting Topology of Protein Language Model Attention Maps for Token Classification

Maria Ivanova, Ilya Trofimov, Pavel Strashnov, Nikita Ivanisenko, Serguei Barannikov, Evgeny Burnaev, Olga Kardymon

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein language models, protein property prediction, topological data analysis, attention maps, transformers

TL;DR: We propose to use Topological Data Analysis of attention maps in a protein language model for amino-acid classification

Abstract: In this paper, we introduce a method to extract topological features from transformer-based protein language models. Our method leverages the persistent homology of attention maps to generate features for token (per amino-acid) classification tasks and demonstrate its relevance in a biological context. We implement our method on transformer-based protein language models using the family of ESM-2 models. Specifically, we demonstrate that minimum spanning trees, derived from attention matrices, encode structurally significant information about proteins. In our experiments, we combine these topological features with standard embeddings from ESM-2. Our method outperforms traditional approaches and other transformer-based methods with a similar number of parameters in several binding site identification tasks and achieves state-of-the-art performance in conservation prediction tasks. Our results highlight the potential of this hybrid approach in advancing the understanding and prediction of protein functions.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11536

Loading