Keywords: Chemistry, Large Language Model, Synthetic Data
Abstract: Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction–response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks and ensures response precision through tool planning \& distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1)  the \textbf{high quality} of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2)  the \textbf{dynamic generation of evaluation tasks} that more effectively reveal LLM weaknesses in chemistry; and 3)  the significant \textbf{improvement of LLM chemistry capabilities} when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs. The code is available at \url{https://anonymous.4open.science/r/ChemOrch-854A}.
Supplementary Material:  zip
Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 6578
Loading