BigMixSolDB: Extraction of a solubility database in solvent mixtures with an uncertainty-quantified large language model-based pipeline

Published: 05 Apr 2026, Last Modified: 06 May 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: The rapidly expanding body of chemical literature contains physical data scattered across unstructured text and complex tables in thousands of publications. Translating this information into machinereadable formats is essential for training data-driven chemical models. The lack of uncertainty quantification in computationally extracted datasets make manual extraction and curation still necessary. We here address this challenge and present an uncertainty-quantified data extraction pipeline. Our fully automated pipeline utilizes large language models (LLMs) to extract complex tabular and textual chemical data directly from scientific PDFs. We systematically assess several modern LLMs and document-to-text conversion frameworks. We benchmark the uncertainty in extracted data by comparison of our computationally extracted databases with large, existing databases on solubility that are manually curated. Our pipeline achieve between 93 and 79% of the literature contained data, with extraction errors being on par with aleatoric limits that exist in the manually curated databases. We showcase the utility of our pipeline by deploying the optimized pipeline across a diverse corpus of 2,793 articles to generate BigMixSolDB, a comprehensive database on solubility in complex mixtures. BigMixSolDB comprises over 325,000 filtered solubility entries spanning single, binary, and ternary solvent systems. Our results demonstrate how integrated LLM-based pipelines can be used for literature data extraction that is comparable with manual curation. We envision these frameworks can be used for large-scale, accurate extraction of literature data into datasets for machine learning applications.
Loading