FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

ACL ARR 2026 January Submission533 Authors

23 Dec 2025 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmarking, NLP datasets, evaluation methodologies, financial/business NLP, chain-of-thought, reasoning
Abstract: Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier proprietary LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. The code and data are uploaded to Software and Data for review.
Paper Type: Long
Research Area: Financial Applications and Time Series
Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, financial/business NLP, chain-of-thought, reasoning
Contribution Types: Data resources
Languages Studied: English
Submission Number: 533
Loading