FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

ACL ARR 2025 February Submission7280 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, in long-form question answering (LFQA), they often struggle with factual accuracy, frequently generating hallucinated responses. In this work, we introduce FinLFQA, a benchmark designed to evaluate LLMs' ability to generate answers with reliable attributions. FinLFQA evaluates three key aspects: (1) evidence-supported content to enhance factual grounding and verifiability, (2) step-by-step calculations using executable code for numerical reliability, and (3) domain-specific reasoning informed by knowledge. We conduct an extensive evaluation of eight LLMs, leveraging the developed automated evaluation protocol to evaluate their performance. Our findings show that GPT-4o outperforms other models, while open-sourced models are closing the gap with proprietary models, which demonstrates that open-source models are becoming competitive alternatives for real-world applications. We also find that post-hoc and end-to-end generation perform similarly, while iterative self-feedback provides no significant improvement except external signal is provided.

Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation
Languages Studied: English
Submission Number: 7280
Loading