Abstract: Protein structure tokenization provides a discrete interface between 3D geometry and
modern learning systems, with applications in reconstruction, retrieval, and generative
modeling. However, existing protein structure tokenizers are still not sufficiently accurate,
robust to structural perturbations, or efficient enough for real-world use, and the field still
lacks a fully open, end-to-end method that combines these properties with transparent
reproducibility for the community. In this work, we introduce GCP-VQVAE, a fully open
discrete protein structure tokenizer built around a chirality-aware, SE(3)-equivariant GCPNet
encoder. Our design is motivated by the hypothesis that stronger geometry-aware continuous
representations provide a better substrate for discrete structure tokenization.
Trained on monomer protein backbone structures from the AlphaFold Protein Structure
Database, GCP-VQVAE delivers the strongest reconstruction performance among the opensource
baselines evaluated in this work. For example, it attains 0.5293 Å RMSD on CASP15,
reducing error by 38.5% relative to the strongest prior open baseline (AIDO), and 0.8193
Å RMSD on a zero-shot benchmark of 1,938 newly deposited experimental structures, a
59.2% improvement over the same baseline. In addition, the Large and Lite variants are
approximately 408× and 530× faster SOTA, respectively, while remaining robust to structural
perturbations such as rigid-body rotations and other input corruptions. To the best of our
knowledge, this is the first protein structure tokenizer to release the full training pipeline,
datasets, model weights, and implementation details, providing a fully transparent and
reproducible foundation for the community to build on.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lijun_Wu1
Submission Number: 9435
Loading