Protein Structure Tokenization: Benchmarking and New Recipe

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A comprehensive benchmark for protein structure tokenization methods, focusing on fine-grained local protein structure representation, with a new tokenizer recipe showing enhanced performance
Abstract: Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce **StructTokenBench**, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to develop **AminoAseed**, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31\% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83\% and 124.03\%, respectively. Source code and model weights are available at https://github.com/KatarinaYuan/StructTokenBench.
Lay Summary: Proteins are the tiny machines that keep our bodies alive and functioning. To understand how they work, scientists often study their 3D shapes. But analyzing these complex shapes with AI is not easy — we first need to “translate” them into tokens, like how words are made from letters. This process is called protein structure tokenization. Our research introduces a new benchmark, **StructTokenBench**, to evaluate how well different AI methods perform this translation. We tested many approaches and found that each has its strengths and weaknesses. Based on these insights, we developed a new method called **AminoAseed**, which improves how effectively AI can learn from protein shapes. This work could help computers better understand proteins, leading to advances in biology, drug discovery, and disease research. Just as language models transformed how machines understand text, our tools aim to do the same for protein structures — unlocking a new era of scientific discovery.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/KatarinaYuan/StructTokenBench
Primary Area: Applications->Health / Medicine
Keywords: protein structure tokenization, benchmarking, VQ-VAE
Submission Number: 11790
Loading