Multi-LLM and Multi-Prompt Strategies for COVID-19 Infodemic Detection in Chinese Social Media: An Empirical Evaluation

Teng Zuo; Hongwen Lin; Lingfeng He; Hongji Zeng; Lina Tang; Li He; Ning Li

Multi-LLM and Multi-Prompt Strategies for COVID-19 Infodemic Detection in Chinese Social Media: An Empirical Evaluation

Teng Zuo, Hongwen Lin, Lingfeng He, Hongji Zeng, Lina Tang, Li He, Ning Li

Published: 08 Oct 2025, Last Modified: 18 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Infodemic, Social Media

TL;DR: Under optimal configuration, LLMs can assist in detecting COVID-19 misinformation on Chinese social media.

Abstract: \textbf{Objective:} Misinformation during the COVID-19 infodemic poses a serious public health risk. We investigate whether large language models (LLMs) can automatically identify COVID-19 misinformation in Chinese social media content, and how different prompting strategies affect performance. \textbf{Methods:} We evaluate ten LLMs on 640 physician-verified misinformation posts from a prior mixed-methods study (March 2022-October 2023). Each model issues a five-level predicted verdict (False / Likely-False / Ambiguous / Likely-True / True) under five prompting strategies (no-role; public-health expert; respiratory specialist; public-health expert + source/date context; respiratory specialist + source/date context). A single Qwen judge (\texttt{qwen-turbo-latest}) maps model responses to one of the five labels. We report strict accuracy (credit only False), lenient accuracy (credit False or Likely-False), ambiguity rate (Ambiguous), error rate (Likely-True / True), and a composite score. \textbf{Results:} Across all experiments, the average lenient accuracy was 61.2\%, with a low overall ambiguity rate (<2\%). Performance was highly model-dependent: the top-performing configuration achieved approximately 90\% lenient accuracy, while more conservative models incorrectly accepted over 50\% of false posts. Counterintuitively, prompting with expert personas and contextual details did not uniformly improve performance and, in many cases, reduced the models' flagging rates. \textbf{Contributions:} (1) An empirical, multi-LLM, multi-prompt evaluation on a previously established Chinese COVID-19 misinformation corpus. (2) A systematic comparison of five prompt strategies, quantifying how adding source/date context tends to reduce flagging on this all-misinformation benchmark while modestly lowering ambiguity. (3) Evidence that persona choice (public-health vs respiratory specialist) is not uniformly beneficial across posts and prompts. (4) A reproducible release (prompts, code, judging templates, redacted logs) to support Chinese-language infodemic monitoring and future replication.

Supplementary Material: zip

Submission Number: 199

Loading