FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models

FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models

TMLR Paper6390 Authors

05 Nov 2025 (modified: 31 Dec 2025)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Financial management tasks are pivotal to global economic stability; however, their efficient execution faces persistent challenges, including labor intensive processes, low error tolerance, data fragmentation, and limitations in existing technological tools. Although large language models (LLMs) have shown remarkable success in various natural language processing (NLP) tasks and have demonstrated potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance suffer from insufficient domain-specific data, simplistic task design, and incomplete evaluation frameworks. To address these gaps, in this work, we present \textbf{FinMaster}, a comprehensive financial management benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, \textbf{FinMaster} comprises three main modules: i) \emph{FinSim}, which builds simulators that can generate synthetic, privacy-compliant financial datasets for different types of companies to replicate real-world market dynamics; ii) \emph{FinSuite}, which provides a variety of tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) \emph{FinEval}, which develops a unified evaluation framework for streamlined evaluation. Extensive experiments on state-of-the-art LLMs, such as GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3, reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90\% on basic tasks to merely 40\% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations that initially demonstrated 58\% accuracy decreased to 37\% in multimetric scenarios. To the best of our knowledge, \textbf{FinMaster} is the first benchmark that comprehensively covers full-pipeline financial workflows with challenging and realistic tasks. We hope that \textbf{FinMaster} can bridge the gap between the research community and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance both efficiency and accuracy.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=NffGff5MsB

Changes Since Last Submission: This resubmission includes corrections to the references and appendix.

Assigned Action Editor: ~Sachin_Kumar1

Submission Number: 6390

Loading