A data guided approach to building an ML ready protein expression dataset

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0
Track: Biology: datasets and/or experimental results
Nature Biotechnology: No
Keywords: protein expression, ML, machine learning, data
TL;DR: This project aims to improve the predictive capabilities of protein expression models through open data generation.
Abstract: Recombinant protein expression is central to academic exploration as well as biotechnology’s advancement of human health, climate applications and the bioeconomy in general. However, not all proteins can be expressed in all organisms, and the field lacks a predictive model of soluble protein expression that could replace laborious experimental trial-and-error. This project aims to design and test an openly available and extensible experimental platform and standardized data ontology for collecting soluble recombinant protein expression data across organisms. The resulting public dataset will be used in building predictive models of protein expression. Here we share preliminary assay feasibility data in our first expression host organism, Escherichia coli.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Aviv_Spinner1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 84
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview