BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Guilong Lu; Xuntao Guo; Rongjunchen Zhang; Wenqiao Zhu; Ji Liu

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Financial Benchmark, Large Language Models, Domain-Specific Evaluation, Real-World Financial Data

TL;DR: We propose BizFinBench, the first financial benchmark integrating business-oriented tasks, and introduce IteraJudge, a novel method that enhances LLMs' judging capability by refining decision boundaries in financial evaluations.

Abstract: Large language models excel in general tasks, yet assessing their reliability in logic‑heavy, precision‑critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench，the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 7,605 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We evaluate 30 models, covering both proprietary and open-source systems. The results highlight several key trends: (1) Numerical Calculation: GPT-5 and Gemini-2.5-Pro achieve the best performance, while the open-source DeepSeek-v3.1 demonstrates substantial progress, narrowing the gap with proprietary leaders; (2) Reasoning: proprietary models retain a clear advantage, outperforming open-source counterparts by approximately 10.74\%; (3) Information Extraction: DeepSeek-R1 and DeepSeek-V3 deliver competitive results, closely approaching GPT-5 and Gemini-2.5-Pro; (4) Prediction Recognition: reasoning models (e.g., OpenAI o3 and o4-mini) achieve superior performance. Overall, no single model exhibits dominance across all dimensions, underscoring the multifaceted challenges of financial reasoning. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are included in the supplementary material and will be released publicly upon acceptance.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 5345

Loading