LLM-as-Judge Meets LLM-as-Optimizer: Enhancing Organic Data Extraction Evaluations Through Dual LLM Approaches

Martiño Ríos-García; Kevin Maik Jablonka

LLM-as-Judge Meets LLM-as-Optimizer: Enhancing Organic Data Extraction Evaluations Through Dual LLM Approaches

Martiño Ríos-García, Kevin Maik Jablonka

Published: 03 Mar 2025, Last Modified: 09 Apr 2025AI4MAT-ICLR-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Full Paper

Submission Category: AI-Guided Design

Keywords: LLM-as-Judge, LLM-as-Optimizer, data-extraction, LLM evaluation

TL;DR: Dual LLM framework enhances chemical data extraction by using an LLM-as-Judge to evaluate and an LLM-as-Optimizer to refine prompts, achieving high agreement with expert chemists through systematic analysis of 800+ reaction steps.

Abstract: Large language models (LLMs) show promise for extracting structured data from scientific literature, but their use in chemistry faces unique challenges due to the complex, variable nature of experimental procedures. Here, we present a dual-LLM framework that combines an LLM-as-Judge to evaluate data extraction quality with an LLM-as-Optimizer to refine evaluation prompts systematically. To evaluate the performance, we leverage a manually annotated dataset of over 800 reaction steps in an action-centric schema that captures the sequential nature of chemical procedures rather than relying on rigid key-value pairs that are conventionally used. Through systematic analysis of various parameters, including temperature settings and prompt structures, we identify optimal configurations that maximize agreement with expert chemists while minimizing computational costs. The framework shows good agreement with expert annotations while reducing manual prompt engineering effort. This systematic approach not only demonstrates how modern machine learning techniques can address fundamental challenges in scientific data extraction but also provides a reusable pipeline for evaluating extraction results across domains where experimental variability has historically limited the development of standardized evaluation metrics.

Submission Number: 32

Loading