Benchmark Creation for Narrative Knowledge Delta Extraction Tasks: Can LLMs Help?

Alaa El-Ebshihy, Annisa Maulida Ningtyas, Florina Piroi, Andreas Rauber

Published: 2025, Last Modified: 29 Jan 2026ECIR (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Keeping up with state-of-the-art in science is increasingly difficult for researchers due to the current pace of publishing. Inspired by previous work, we address this challenge by formulating the task of Narrative Knowledge Delta (\({\mathcal {N}\mathcal {K}\varDelta }\)) Extraction which focuses on identifying differences between pairs of scientific articles that tackle the same research problem, presented in a narrative form. We create two manually annotated ground truth datasets and one automatically generated dataset of \({\mathcal {N}\mathcal {K}\varDelta }\) sentences. Using these datasets, we design and evaluate a \({\mathcal {N}\mathcal {K}\varDelta }\) extraction approach from pairs of papers using four LLMs: GPT-4o, GPT-4o-mini, Llama3.1-8b, and Llama3.1-70b. We then apply a scientific fact-checking model to evaluate the LLMs’ \({\mathcal {N}\mathcal {K}\varDelta }\) output using manually annotated data as ground truth claims. The results show a general improved performance in few-shot settings when examples from the automatically generated data are incorporated. However, our manual analysis reveals challenges and limitations in creating annotated data for evaluating \({\mathcal {N}\mathcal {K}\varDelta }\) extraction by LLMs (Data, prompts, and code are available at: https://github.com/Alaa-Ebshihy/nkd_llm_2024).

External IDs:dblp:conf/ecir/ElEbshihyNPR25