Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents

Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents

ACL ARR 2026 January Submission4267 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Error Detection, Financial Domain

Abstract: Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly **Bench**mark for **Fin**ancial **E**rror **D**etection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, automatic creation and evaluation of language resources

Languages Studied: Chinese, English

Submission Number: 4267

Loading