CoSMis: A Hybrid Human-LLM COVID Related Scientific Misinformation Dataset and LLM pipelines for Detecting Scientific Misinformation in the Wild

Yupeng Cao; Aishwarya Nair; Nastaran Jamalipour Soofi; Elyon Eyimife; Koduvayur Subbalakshmi

CoSMis: A Hybrid Human-LLM COVID Related Scientific Misinformation Dataset and LLM pipelines for Detecting Scientific Misinformation in the Wild

Yupeng Cao, Aishwarya Nair, Nastaran Jamalipour Soofi, Elyon Eyimife, Koduvayur Subbalakshmi

Published: 13 Jan 2025, Last Modified: 26 Feb 2025AAAI 2025 PDLM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Misinformation in Scientific Reporting, Large Language Models, AI-generated Misinformation, Explainability

Abstract: Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in scientific publications vs reporting. This problem is exacerbated by the prevalence of large language model generated misinformation. In this paper, we address the problem of automatic detection of misinformation in a more realistic scenario where there is no prior knowledge of the origin (LLM or human written) of the text, and explicit claims may not be available. We first introduce a novel labeled dataset, CoSMis, comprising of 2,400 scientific news stories sourced from both reliable and unreliable outlets, paired with relevant abstracts from the CORD-19 database. Our dataset uniquely includes both human-written and LLM-generated news articles. We propose a set of dimensions of scientific validity (DoV) along which to evaluate the articles for misinformation. These are then incorporated into the prompt structures for the LLMs. We propose three LLM pipelines to compare scientific news to relevant research papers and classify for misinformation. The three pipelines represent different levels of intermediate processing steps on the raw scientific news articles and research papers. We apply various prompt engineering strategies: zero-shot, few-shot, and DoV-guided Chain-of-Thought prompting, to these architectures and evaluate them using GPT-3.5, GPT-4, Llama2-7B/13B/70B and Llama3-8B.

Submission Number: 37

Loading