Abstract: We introduce FinEvalQA as a new evaluation dataset designed to assess the quality of financial domain question answering (QA) systems. FinEvalQA is built upon two widely used datasets, FiQA and Finance-Alpaca, and includes fine-grained annotations in two dimensions: comprehensiveness and hallucination rate. We propose a question structure-aware generation framework that parses complex financial queries into semantically organized components, helping large language models (LLMs) better focus on the intent and scope of the question during answer generation. Empirical results show that our structured approach substantially reduces hallucination rate (up to 42.4%) and significantly increases comprehensiveness (up to 75.8%) across different models and datasets, highlighting its effectiveness for long-form financial QA.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Financial Question Answering, Large Language Models (LLMs), Dataset, Hierarchical Query Structuring, Comprehensiveness, Hallucination Reduction
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2673
Loading