ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions

Yue Huang; Zhengzhe Jiang; Xiaonan Luo; Kehan Guo; Haomin Zhuang; Yujun Zhou; Zhengqing Yuan; Xiaoqi Sun; Jules Schleinitz; Yanbo Wang; Shuhao Zhang; Mihir Surve; Nitesh V Chawla; Olaf Wiest; Xiangliang Zhang

ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions

Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, Xiangliang Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chemistry, Large Language Model, Synthetic Data

Abstract: Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction–response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks and ensures response precision through tool planning \& distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the \textbf{high quality} of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the \textbf{dynamic generation of evaluation tasks} that more effectively reveal LLM weaknesses in chemistry; and 3) the significant \textbf{improvement of LLM chemistry capabilities} when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs. The code is available at \url{https://anonymous.4open.science/r/ChemOrch-854A}.

Supplementary Material: zip

Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 6578

Loading