ChemLit-QA: A human evaluated dataset for chemistry RAG tasks

Geemi Wellawatte; Huixuan Guo; Magdalena Lederbauer; Anna Borisova; Matthew Hart; Marta Brucka; Philippe Schwaller

ChemLit-QA: A human evaluated dataset for chemistry RAG tasks

Geemi Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, Philippe Schwaller

Published: 08 Oct 2024, Last Modified: 03 Nov 2024AI4Mat-NeurIPS-2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: LLMs for Materials Science - Full Paper

Submission Category: AI-Guided Design

Keywords: Datasets in Chemistry, Large Language Models, RAG, finetuning

Supplementary Material: pdf

TL;DR: ChemLit-QA, a large (1,054 entries), expert-evaluated, open-source, open-ended scientific QAC dataset specifically designed for chemistry.

Abstract: Retrieval-Augmented Generation (RAG) is a widely used strategy in Large-Language Models (LLMs) to extrapolate beyond the inherent pre-trained knowledge. Hence, RAG is crucial when working in data-sparse fields such as Chemistry. The evaluation of RAG systems is commonly conducted using specialized datasets. However, existing datasets, typically in the form of scientific Question-Answer-Context (QAC) triplets or QA pairs, are often limited in size due to the labor-intensive nature of manual curation or require further quality assessment when generated through automated processes. This highlights a critical need for large, high-quality datasets tailored to scientific applications. We introduce ChemLit-QA, a comprehensive, expert-validated, open-source dataset comprising over 1,000 entries specifically designed for chemistry. Our approach involves the initial generation and filtering of a QAC dataset using an automated framework based on GPT-4 Turbo, followed by rigorous evaluation by chemistry experts. Additionally, we provide two supplementary datasets: ChemLit-QA-neg focused on negative data, and ChemLit-QA-multi focused on multihop reasoning tasks for LLMs, which complement the main dataset on hallucination detection and more reasoning-intensive tasks.

Submission Number: 49

Loading