Language Models as Tools for Research Synthesis and Evaluation

Published: 17 Jun 2024, Last Modified: 17 Jul 2024ICML2024-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Metascience, Research Synthesis, Behavioral Science, Integrative Experiment Design
TL;DR: Experiments with large language models and retrieval-augmented generation show the possibility of evaluating academic papers based on its contribution to predictive accuracy under intervention, sparking various meta-scientific discussions.
Abstract: Is academic literature building cumulative knowledge that improves the ability to make predictions under interventions? This question touches up not only on the internal validity of individual findings, but also on their external validity and whether science is a cumulative enterprise that generates collectively more accurate representations of the world. Such synthesis and evaluation face significant challenges especially in the social and behavioral sciences due to the system's complexity and less structured nature of research outputs. Motivated by such challenges, we propose a novel method involving large language models (LLMs) and retrieval-augmented generation (RAG) techniques to measure how various sets of academic papers affect the accuracy of predictive models. We elicit LLMs' predictions on the treatment effect of introducing punishment in public goods games (PGG) under 20 varying dimensions in the game design space that shows high heterogeneity. We demonstrate the LLM’s ability to retrieve academic papers and alter its distribution of predictions in directions that are expected based on the documents' contents. However, we find little evidence that such updates improve the model's predictive accuracy. The framework introduces a method for evaluating the potential contribution and informativeness of scientific literature in prediction tasks, while also introducing new human behavior dataset of PGG carefully collected from integrative experiment design that can be used as a benchmark for LLM's performance in making predictions about complex human behavior.
Submission Number: 44
Loading