UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images, and Videos
Keywords: financial domain, benchmark, multimodal large language model, multimodal QA
Abstract: Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs’ capabilities in fine-grained, high–information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at https://anonymous.4open.science/r/anonym4B75.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: finance, multimodal large language model, multimodal QA, knowledge base QA, logical reasoning QA, open-domain QA
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese, English
Submission Number: 1459
Loading