GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

TMLR Paper9435 Authors

03 Jun 2026 (modified: 05 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Protein structure tokenization provides a discrete interface between 3D geometry and modern learning systems, with applications in reconstruction, retrieval, and generative modeling. However, existing protein structure tokenizers are still not sufficiently accurate, robust to structural perturbations, or efficient enough for real-world use, and the field still lacks a fully open, end-to-end method that combines these properties with transparent reproducibility for the community. In this work, we introduce GCP-VQVAE, a fully open discrete protein structure tokenizer built around a chirality-aware, SE(3)-equivariant GCPNet encoder. Our design is motivated by the hypothesis that stronger geometry-aware continuous representations provide a better substrate for discrete structure tokenization. Trained on monomer protein backbone structures from the AlphaFold Protein Structure Database, GCP-VQVAE delivers the strongest reconstruction performance among the opensource baselines evaluated in this work. For example, it attains 0.5293 Å RMSD on CASP15, reducing error by 38.5% relative to the strongest prior open baseline (AIDO), and 0.8193 Å RMSD on a zero-shot benchmark of 1,938 newly deposited experimental structures, a 59.2% improvement over the same baseline. In addition, the Large and Lite variants are approximately 408× and 530× faster SOTA, respectively, while remaining robust to structural perturbations such as rigid-body rotations and other input corruptions. To the best of our knowledge, this is the first protein structure tokenizer to release the full training pipeline, datasets, model weights, and implementation details, providing a fully transparent and reproducible foundation for the community to build on.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lijun_Wu1
Submission Number: 9435
Loading