Challenges and Vision For Standardization of Biopolymer Datasets for Machine Learning

Jessica N. Lalonde; Defne Circi; Babetta L. Marrone; Stefan Zauscher; L. Catherine

Challenges and Vision For Standardization of Biopolymer Datasets for Machine Learning

Jessica N. Lalonde, Defne Circi, Babetta L. Marrone, Stefan Zauscher, L. Catherine

Published: 02 Mar 2026, Last Modified: 08 Apr 2026AI4Mat-ICLR-2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: biopolymers, machine learning, materials informatics, data representation

TL;DR: ML-driven biopolymer discovery is bottlenecked by inadequate materials data representation, quality, and sharing, and requires biopolymer-specific encodings, standardized metadata, and FAIR, community-driven data infrastructure.

Abstract: Machine learning (ML) is transforming materials research, yet potential for biopolymer discovery remains constrained by fragmented data and non-standardized reporting. Biopolymers differ significantly from synthetic polymers, requiring specialized approaches to represent their biosynthetic origins, hierarchical structures, and application-specific metrics. In this perspective, we identify three core challenges limiting biopolymer representation: information encoding, data quality, and data sharing. Unlike prior reviews on polymer informatics, this perspective explicitly focuses on biopolymer-specific challenges arising from biosynthetic variability, hierarchical structure, and environmental sensitivity, and outlines interoperable, ML-ready solutions tailored to these three key challenges. Recommendations include the design and adoption of biopolymer-specific fingerprinting frameworks, the development of hybrid data extraction strategies, and the expansion of Findable, Accessible, Interoperable, Reusable (FAIR)-compliant repositories. We propose a robust foundation to define interoperable, high-quality datasets that capture the full context of biopolymer materials. Standardized metadata, shared ontologies, and community-driven infrastructure will enable scalable, reproducible workflows and accelerate the ML-driven development of biopolymers.

Submission Track: Findings, Tools, & Open Challenges (Tiny Paper)

Submission Category: AI-Guided Design

Submission Number: 19

Loading