Keywords: molecular tokenization, 3D molecules, coordinate discretization, finite scalar quantization, conformer reconstruction, chemical validity
TL;DR: We introduce CoordToken, a simple discrete tokenizer for 3D small-molecule coordinates that substantially improves reconstruction accuracy while preserving physical plausibility.
Abstract: Atom-level generation of chemical structures, including drug-like molecules, is an increasingly active research direction. Due to the continuous nature of atomic coordinates, 3D structure generation has been mostly done with diffusion-style methods, with only a few attempts at leveraging autoregressive models. In this work, we develop CoordToken, a simple recipe to train tokenizers for 3D molecules.
We train CoordToken on two datasets: (i) on $\nabla^2$DFT where we obtain a $0.048$Å reconstruction error, which is a $4.2\times$ reduction compared to a prior method, and (ii) on a large corpus of 196M molecules, where we obtain micro averaged RMSD of $0.070$Å across all test datasets, including $0.074$Å on $\nabla^2$DFT. The tokenizer maintains near-perfect physical plausibility with a 98\% pass rate on the PoseBusters checklist. The tokenizer and the corresponding dataset are available at https://github.com/YerevaNN/CoordToken.
Submission Number: 87
Loading