FinLBench: A Benchmark for Evaluating Large Language Models on Long-Text Financial Documents

FinLBench: A Benchmark for Evaluating Large Language Models on Long-Text Financial Documents

ACL ARR 2024 December Submission2287 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The application of large language models (LLMs) in the financial domain is increasing, highlighting the necessity for standardized evaluations. The financial sector contains a vast amount of lengthy documents, such as prospectuses, investment research reports, and policy research reports. However, there is currently a lack of effective evaluation datasets and benchmarks to assess the understanding, analysis, and reasoning capabilities of LLMs with respect to these long documents. To address this issue, we introduce FinLBench, a comprehensive evaluation benchmark designed to assess the ability of LLMs to understand and analyze Chinese financial long documents. FinLBench consists of two key components: the FinLEval dataset and a six-dimensional evaluation framework tailored for LLMs in the financial domain. FinLBench includes six types of long financial documents, twelve sub-tasks, and 3,219 manually annotated question-answer pairs derived from real financial scenarios. Additionally, we conducted extensive research using FinLBench on 8 popular commercial LLMs and 2 open-source LLMs. The experimental results indicate that: 1) Commercial LLMs outperform open-source LLMs on this benchmark; 2) All LLMs exhibit hallucination issues when evaluated on trap questions. Our empirical research results provide valuable insights for the study of LLMs in the financial domain and lay the foundation for more principled evaluations of these models. Benchmark and dataset will be open-sourced at https://anonymous.4open.science/r/FinLBench-2F95/README.md.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Large Language Models, financial, long-text, benchmark

Contribution Types: Data resources, Data analysis

Languages Studied: Chinese

Submission Number: 2287

Loading