Emergent Pose-Invariance in 3D Molecular Representations via Multimodal Learning

Eduardo Soares; Victor Yukio Shirasuna; Emilio Vital Brazil; Dmitry Zubarev; Enzo Reis de Oliveira; Caio Rodrigues Gama; Daniel Djinishian de Briquez

Emergent Pose-Invariance in 3D Molecular Representations via Multimodal Learning

Eduardo Soares, Victor Yukio Shirasuna, Emilio Vital Brazil, Dmitry Zubarev, Enzo Reis de Oliveira, Caio Rodrigues Gama, Daniel Djinishian de Briquez

Published: 20 Sept 2025, Last Modified: 29 Oct 2025AI4Mat-NeurIPS-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Contrastive Learning, 3D electron density grids, SMILES, SO(3) invariance

Abstract: Learning molecular representations that are robust to 3D rotations typically requires architectures with built-in symmetry priors or extensive data augmentation. In this work, we investigate whether contrastive multimodal pretraining alone can induce SO(3) invariance in molecular embeddings. We jointly train a continuous 3D-field encoder, based on a vector-quantized generative adversarial network (VQGAN), and a SMILES-based transformer encoder on a dataset of 855,000 molecules, each represented by a DFT-computed electron density grid and a corresponding canonical SMILES string. Both CLIP-style and SigLIP contrastive objectives are used to align representations across modalities. Because SMILES embeddings are invariant to molecular orientation, the contrastive loss implicitly encourages the 3D encoder to produce rotation-consistent representations by aligning different poses of the same molecule to a fixed symbolic anchor. To evaluate geometric generalization, we construct a benchmark comprising 1,000 molecules with five unseen random SO(3) rotations each. The CLIP-based model retrieves at least one rotated variant among its top-10 results for 77\% of queries, compared to 9.8\% for a unimodal VQGAN baseline, and retrieves three or more variants for 45\% of queries (versus 0\% baseline). Functional group-wise Recall@10 exceeds 98\% for most chemical classes, and clustering by HOMO energy yields a Davies–Bouldin index of 2.35 (versus 34.46 for the baseline), indicating strong chemical organization in the latent space. Additionally, fine-tuning with rotated samples reveals a trade-off between retrieval precision and pose diversity. These results suggest that contrastive multimodal pretraining can yield symmetry-aware molecular representations, even in the absence of explicit equivariant design.

Submission Track: Paper Track (Full Paper)

Submission Category: Automated Material Characterization

Institution Location: {Rio de Janeiro, Brazil}, {San Jose, United States}

AI4Mat Journal Track: Yes

Submission Number: 16

Loading