Benchmarking Augmentation Strategies for LLM-Based Solid-State Synthesis Prediction

Thorben Prein; Elton Pan; Anass Al ammiri; Alan Albert Piovesana; Behsad Riemer; Elsa Olivetti; Jennifer L.M. Rupp

Benchmarking Augmentation Strategies for LLM-Based Solid-State Synthesis Prediction

Thorben Prein, Elton Pan, Anass Al ammiri, Alan Albert Piovesana, Behsad Riemer, Elsa Olivetti, Jennifer L.M. Rupp

Published: 02 Mar 2026, Last Modified: 08 Apr 2026AI4Mat-ICLR-2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, materials synthesis, retrieval-augmented generation, multi-agent systems, test-time compute, inorganic materials

TL;DR: We benchmark RAG, thermodynamic tools, and multi-agent workflows for LLM-based solid-state synthesis prediction, finding that retrieval of similar recipes outperforms both tool augmentation and test-time compute strategies.

Abstract: Identifying synthesis recipes for new inorganic materials remains a major bottleneck in materials discovery. We investigate whether large language models (LLMs) can improve solid-state synthesis prediction through three augmentation strategies: retrieval-augmented generation (RAG) from the literature, the use of domain-specific thermodynamic tools, and multi-step, test-time compute workflows such as debate, self-reflection, and sequential pipelines. When evaluating on 674 literature-derived targets, we find that retrieving relevant synthesis precedents is the most effective strategy, improving top-10 precursor accuracy from 77.0\% to 83.5\%. Thermodynamic tools also improve performance (80.6\%), but provide little additional benefit when RAG is already used (82.9\% on Gemini 3 Flash, 77.5\% on Gemini 2 Flash). By contrast, test-time compute does not improve performance, and sequential multi-agent workflows often reduce accuracy because errors introduced in earlier stages propagate downstream, causing later steps to mis-rank candidates or overwrite correct answers. Our results show that, for solid-state synthesis prediction, providing models with relevant domain information is more effective than increasing test-time compute through multi-agent deliberation.

Submission Track: Paper Track (Tiny Paper)

Submission Category: Automated Synthesis

Submission Number: 52

Loading