Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation

Yuni Susanti; Nina Holsmoelle

Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation

Yuni Susanti, Nina Holsmoelle

Published: 02 Jun 2025, Last Modified: 02 Jun 2025Causal-NeSy OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: causal graph, causal relation, LLM, fine-tuning, prompting

Abstract: This study explores the capability of Large Language Models (LLMs) to evaluate causality in causal graphs generated by conventional statistical causal discovery methods—a task traditionally reliant on manual assessment by human subject matter experts. To bridge this gap in causality assessment, LLMs are employed to evaluate the causal relationships by determining whether a causal connection between variable pairs can be inferred from textual context. Our study compares two approaches: (1) prompting-based method for zero and few-shot causal inference (unsupervised) and, (2) fine-tuning language models for the causal relation prediction task (supervised). While prompt-based LLMs have demonstrated versatility across various NLP tasks, our experiments on biomedical and general-domain datasets show that fine-tuned models consistently outperform them, achieving up to a 20.5-point improvement in F1 score—even when using smaller-parameter language models. These findings provide valuable insights into the strengths and limitations of both approaches for causal graph evaluation.

Submission Number: 3

Loading