Disconnecting The Dots: Creating Leakage-Free Protein Datasets by Removal of Densely Connected Data Points

Charlotte Rochereau; Arthur Valentin; Gergo Nikolenyi; Mohammed AlQuraishi

Disconnecting The Dots: Creating Leakage-Free Protein Datasets by Removal of Densely Connected Data Points

Charlotte Rochereau, Arthur Valentin, Gergo Nikolenyi, Mohammed AlQuraishi

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: data splitting, clustering, biology, protein function prediction, protein representations

TL;DR: A novel approach for creating leakage-free protein datasets by sparsely removing central data points.

Abstract: Biological systems arise through evolutionary processes that effectively render all biological data, at scales ranging from biomolecules to organisms, to be evolutionarily related. This poses a challenge to assessments of model generalization, as naive random splits do not safeguard against data leakage; all data points are in some sense related, and their degree of relatedness lies on a continuum. To address this challenge, various similarity metrics are typically used to cluster data prior to splitting to ensure dissimilarity of resulting partitions. However, as we show in this study, similarity thresholds that lead to well-behaved splits (large numbers of homogeneously sized clusters) must invariably be too permissive, thus only permitting assessment of weak generalization. Conversely, stringent thresholds that could in principle enable assessment of strong generalization typically fail to produce well-separated clusters, yielding one or a handful of very large clusters that span the entire dataset. Here, we propose a new data splitting methodology that optimally balances these competing considerations by relaxing the assumption that all data points must be retained. Instead, through a principled and judicious removal of highly central data points, our approach yields well-behaved data splits that enable assessment of extreme generalization regimes. We demonstrate its utility by investigating the impact of diverse proteins representations on protein function prediction. Our experiments confirm the robustness of our new methodology and provide insights into the utility and behavior of protein representations under previously untested regimes of sequence and structure generalization.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12558

Loading