Abstract: With the iterative upgrades of LLMs, their potential for assisting real-world fact-checking has attracted growing interest. However, their effectiveness in detecting misinformation and providing reliable fact-checking explanations has not been thoroughly explored. To address this gap, we propose a comprehensive framework to evaluate and improve LLMs in real-world fact-checking: First, we introduce \texttt{CANDY}, a benchmark with a structured taxonomy specifically designed to evaluate LLMs' performance in misinformation scenarios. Second, we present \texttt{CANDYSET}, a new dataset that enables a detailed evaluation of LLMs' strengths, weaknesses, and risks in fact-checking tasks. Third, leveraging \texttt{CANDY}, we conduct an in-depth analysis to uncover task-specific limitations of LLMs. Our finding indicate that the inherent deficiencies of current LLMs indeed hinder real-world fact-checking practices but also highlight the potential for enhancing task performance through internal optimization. Our work provides a solid foundation for future research. Data samples can be accessed at \url{https://anonymous.4open.science/status/CANDY-7D2E}.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking fact-checking
Contribution Types: Data resources
Languages Studied: English Chinese
Submission Number: 8067
Loading