Detecting Symmetry-Breaking in Molecular Data Distributions

Hannah Lawrence; Elyssa Hofgard; Yuxuan Chen; Tess Smidt; Robin Walters

Detecting Symmetry-Breaking in Molecular Data Distributions

Hannah Lawrence, Elyssa Hofgard, Yuxuan Chen, Tess Smidt, Robin Walters

Published: 03 Mar 2025, Last Modified: 09 Apr 2025AI4MAT-ICLR-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Paper Track (Tiny Paper)

Submission Category: AI-Guided Design

Keywords: equivariance, augmentation, canonicalization, symmetry breaking, distribution shift, classifier test, symmetry

TL;DR: We present a metric for quantifying the degree of distributional symmetry-breaking in materials datasets, which relates to the utility of equivariant models.

Abstract: Equivariant models, which enforce physical symmetries (such as rotations and permutations), have proven very successful at materials science tasks. The usual justification for this success is that symmetry transformations relate data samples, which improves generalization and data efficiency. However, this explanation assumes that transformed versions of a given molecule are highly likely under the data distribution. In this work, we develop a method for testing this assumption by measuring the amount of symmetry in a data distribution. Specifically, we propose a two-sample classifier test which distinguishes between the original dataset and its randomly augmented symmetrization. Unlike existing tests of group invariance, our method does not require defining an appropriate parametric test or kernel. We find that in commonly used materials science datasets such as QM9 and MD17, the orientations of molecules are highly non-uniform. Our findings suggest the success of equivariant models on these datasets may depend on other inductive biases, such as local equivariance. Moreover, non-equivariant models may be strongly benefiting from canonicalization of the molecules’ orientations, an oft-overlooked part of the data generation process. As machine learning be- comes increasingly important for materials discovery, it is essential to have tools to critically evaluate the assumptions underlying our data.

Submission Number: 55

Loading