Abstract: The application of large language models (LLMs) in the financial domain is increasing, highlighting the necessity for standardized evaluations. The financial sector contains a vast amount of lengthy documents, such as prospectuses, investment research reports, and policy research reports. However, there is currently a lack of effective evaluation datasets and benchmarks to assess the understanding, analysis, and reasoning capabilities of LLMs with respect to these long documents. To address this issue, we introduce FinLBench, a comprehensive evaluation benchmark designed to assess the ability of LLMs to understand and analyze Chinese financial long documents. FinLBench consists of two key components: the FinLEval dataset and a six-dimensional evaluation framework tailored for LLMs in the financial domain. FinLBench includes six types of long financial documents, twelve sub-tasks, and 3,219 manually annotated question-answer pairs derived from real financial scenarios. Additionally, we conducted extensive research using FinLBench on 8 popular commercial LLMs and 2 open-source LLMs. The experimental results indicate that: 1) Commercial LLMs outperform open-source LLMs on this benchmark; 2) All LLMs exhibit hallucination issues when evaluated on trap questions. Our empirical research results provide valuable insights for the study of LLMs in the financial domain and lay the foundation for more principled evaluations of these models. Benchmark and dataset will be open-sourced at https://anonymous.4open.science/r/FinLBench-2F95/README.md.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, financial, long-text, benchmark
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 2287
Loading